Download Presentation by Daniel Glasner

Semi-Numerical String Matching Semi-numerical String Matching      All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as: Arithmetic. Bit – operations. The fast Fourier transform. Semi-numerical String Matching  We will survey three examples of such methods:  The Random Fingerprint method due to Karp and Rabin. Shift–And method due to Baeza-Yates and Gonnet, and its extension to agrep due to Wu and Manber. A solution to the match count problem using the fast Fourier transform due to Fischer and Paterson and an improvement due to Abrahamson.   Karp-Rabin fingerprint - exact match    Exact match problem: we want to find all the occurrences of the pattern P in the text T. The pattern P is of length n. The text T is of length m. Karp-Rabin fingerprint - exact match  Arithmetic replaces comparisons.  An efficient randomized algorithm that makes an error with small probability. A randomized algorithm that never errors whose expected running time is efficient.   We will consider a binary alphabet: {0,1}. Arithmetic replaces comparisons.   Strings are also numbers, H: strings → numbers. Let s be a string of length n, H ( s)  i 1 2 n  ni s (i ) Definition: let Tr denote the n length substring of T starting at position r. Arithmetic replaces comparisons.  Strings are also numbers, H: strings → numbers. T=10110101 P=0101 T=10110101 P= 0101 H(T5) = 5 = H(P) = 5 T=10110101 P= 0101 H(T2) = 6 ≠ H(P) = 5 Arithmetic replaces comparisons.  Theorem: There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr)  Proof: Follows immediately from the unique representation of a number in base 2. Arithmetic replaces comparisons.  We can compute H(Tr) from H(Tr-1) H (Tr )  2 H (Tr 1 )  2n T (r  1)  T (r  n  1) T=10110101 T1 = 1 0 1 1 T2 = 0 1 1 0 H (T1 )  H (1011)  11 H (T2 )  2 11  24 1  0  22  16  6  H (0110) Arithmetic replaces comparisons.  A simple efficient algorithm:  Compute H(T1). Run over T Compute H(Tr) from H(Tr-1) in constant time, and make the comparisons.   Total running time O(m)? Karp-Rabin   Let’s use modular arithmetic, this will help us keep the numbers small. For some integer p The fingerprint of P is defined by Hp(P) = H(P) (mod p) Karp-Rabin  Lemma: H p ( P )  {[...({[ P (1)  2(mod p)  P (2)]  2(mod p)  P (3)}  2(mod p)  P (4))...](mod p)  P (n)}(mod p)  H ( P )(mod p) And during this computation no number ever exceeds 2p. An example P=101111 p=7 H(P) = 47 Hp(P) = 47 (mod 7) = 5 1 2(mod 7)  0  2 2  2(mod 7)  1  5 5  2(mod 7)  1  4 4  2(mod 7)  1  2 2  2(mod 7)  1  5 5(mod 7)  5  H p ( P ) Karp-Rabin   Intermediate numbers are also kept small. We can still compute H(Tr) from H(Tr-1). Arithmetic: H (Tr )  2 H (Tr 1 )  2n T (r  1)  T (r  n  1) Modular arithmetic: H p (Tr )  [( 2 H (Tr 1 )(mod p))  (2n (mod p))T (r  1)  T (r  n  1)](mod p) Karp-Rabin   Intermediate numbers are also kept small. We can still compute H(Tr) from H(Tr-1). Arithmetic: 2n  2(2n1 ) Modular arithmetic: 2n (mod p)  2(2n1 (mod p))(mod p) Karp-Rabin  How about the comparisons? Arithmetic: There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr) Modular arithmetic: If there is an occurrence of P starting at position r of T then Hp(P) = Hp(Tr) There are values of p for which the converse is not true! Karp-Rabin  Definition: If Hp(P) = Hp(Tr) but P doesn’t occur in T starting at position r, we say there is a false match between P and T at position r. If there is some position r such that there is a false match between P and T at position r, we say there is a false match between P and T. Karp-Rabin  Our goal will be to choose a modulus p such that  p is small enough to keep computations efficient. p is large enough so that the probability of a false match is kept small.  Prime moduli limit false matches  Definition: For a positive integer u, п(u) is the number of primes that are less than or equal to u.  Prime number theorem (without proof): u u   (u)  1.26 ln( u) ln( u) Prime moduli limit false matches  Lemma (without proof): if u ≥ 29, then the product of all the primes that are less than or equal to u is greater than 2u.  Example: u = 29, the prime numbers less than or equal to 29 are: 2,3,5,7,11,13,17,19,23,29, their product is 6,469,693,230 ≥ 536,870,912 = 229 Prime moduli limit false matches  Corollary: If u ≥ 29 and x is any number less than or equal to 2u, then x has fewer than п(u) distinct prime divisors.  Proof: Assume x has k ≥ п(u) distinct prime divisors q1 , …, qk then 2u ≥ x ≥ q1* …* qk but q1* …* qk is at least as large as the product of the first п(u) prime numbers. Prime moduli limit false matches  Theorem: Let I be a positive integer, and p a randomly chosen prime less than or equal to I. If nm ≥ 29 then The probability of a false match between P and T is less than or equal to п(nm) / п(I) . Prime moduli limit false matches       Proof: Let R be the set of positions in T where P doesn’t begin. nm | H ( P )  H ( T ) |  2  s We have sR By the corollary the product has at most п(nm) distinct prime divisors. If there is a false match at position r then p divides | H ( P )  H (Tr ) | thus also divides sR | H ( P )  H (Ts ) | p must be in a set of size п(nm) but p was chosen randomly out of a set of size п(I). Random fingerprint algorithm  Choose a positive integer I. Pick a random prime p less than or equal to I, and compute P’s fingerprint – Hp(P). For each position r in T, comput Hp(Tr) and test to see if it equals Hp(P). If the numbers are equal either declare a probable match or check and declare a definite match.  Running time: excluding verification O(m).   How to choose I    The smaller I is, computations are more efficient The larger I is, the probability of a false match decresses. Proposition: When I = nm2 1. The largest number used in the algorithm requires at most 4(log(n)+log(m)) bits. 2. The probability of a false match is at most 2.53/m. How to choose I  Proof:  (nm ) nm ln( nm 2 )  1.26  2 2  (nm ) nm ln( nm ) 1  ln( n)  2 ln( m )  2.53   1.26  m  ln( n)  ln( m )  m Extensions  An idea: why not choose k primes?  Proposition: when k primes are chosen randomly and independently between 1 and I, the probability of a false match is at most   (nm )  k      (I )   Proof: We saw that if p allows and error it is in a set of at most п(nm) integers. A false match can occur only if each of the independently chosen k primes is in a set of size of at most п(nm) integers. An illustaration  k = 4, n = 250, m = 4000 I = 250*40002 < 232 k   (nm )   2.53  12       10   ( I )   4000  k Even lower limits on the error  When k primes are used, the probability of a false match is at most   (n)  k     (I )   Proof: Suppose a false match occurs at position r. That means that each of the primes must divide |H(P)-H(Tr) | ≤ 2n. There are at most п(n) primes that divide it. Each prime is chosen from a set of size п(I) and by chance is a part of a set of size п(n). Checking for error in linear time    Consider the list L of locations in T where the KarpRabin algorithm declares P to be found. A run is a maximal interval of starting locations l1, l2, …, lr in L such that every two numbers differ by at most n/2. Let’s verify a run. Checking for error in linear time  Check the first two declared occurrences explicitly. P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… P= abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax…   If there is a false match stop. Otherwise P is semi periodic with period d = l1 – l2. Checking for error in linear time  d is the minimal period.  P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… P= abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… Checking for error in linear time P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax…   For each i check that li+1 – li = d. Check the last d characters of li for each i. Checking for error in linear time P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax…  Check l1 Checking for error in linear time P= abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax…  Check l2  P is semi periodic with period 3. Checking for error in linear time T = abbabbabbabbabbabbabbabbabbax…  Check li+1 – li = 3 Checking for error in linear time P= bab T = abbabbabbabbabbabbabbabbabbax…  For each i check the last 3 characters of li. Checking for error in linear time P= bab T = abbabbabbabbabbabbabbabbabbax…  For each i check the last 3 characters of li. Checking for error in linear time P= bab T = abbabbabbabbabbabbabbabbabbax…  For each i check the last 3 characters of li.  Report a false match or approve the run. Time analysis    No character of T is examined more than twice during a single run. Two runs are separated by at least n/2 positions and each run is at least n positions long. Thus no character of T is examined in more than two consecutive runs. Total verification time O(m). Time analysis    When we have a false match we start again with a different prime. The expected probability of a false match is O(1/m). We have converted the algorithm to one that never mistakes with expected linear running time. Why use Karp-Rabin?     It is efficient and simple. It is space efficient. It can be generalized to solve harder problems such as 2-dimensional string matching. It’s performance is backed up by a concrete theoretical analysis. The Shift-And Method The Shift-And Method  We start with the exact match problem.  Define M to be a binary n by m matrix such that: M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = 1 iff P[1 .. i] ≡ T[j-i+1 .. j] The Shift-And Method   Let T = california Let P = for M=   1 2 3 4 5 6 7 8 9 m = 10 1 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 n=3 0 0 0 0 0 0 1 0 0 0 M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j. How does M solve the exact match problem? How to construct M     We will construct M column by column. Two definitions are in order: Bit-Shift(j-1) is the vector derived by shifting the vector for column j-1 down by one and setting the first bit to 1. Example:  0  1      1   0  BitShift(1 )  1       0  1  1   0      How to construct M   We define the n-length binary vector U(x) for each character x in the alphabet. U(x) is set to 1 for the positions in P where character x appears. Example: 1    0 1  U (a )  P = abaac   1  0   0   1  U (b)   0    0 0   0   0 U (c )   0    0 1    How to construct M   Initialize column 0 of M to all zeros For j > 1 column j is obtained by M ( j )  BitShift( j  1)  U (T ( j )) An example j = 1 1 2 3 4 5 6 7 8 9 10 T=xabxabaaxa 12345 P=abaac  0    0 U ( x)   0     0  0   1 1 0 2 0 3 0 4 0 5 0 2 3 4 5 6 7 8 9 1   0   0         0  0  0 BitShift(0) & U (T (1))   0  &  0    0         0  0  0  0  0  0       1 0 An example j = 2 1 2 3 4 5 6 7 8 9 10 T=xabxabaaxa 12345 P=abaac 1    0 U (a )  1    1  0   1 2 1 0 1 2 0 0 3 0 0 4 0 0 5 0 0 3 4 5 6 7 8 9  1   1  1         0  0  0 BitShift(1) & U (T (2))   0  & 1    0         0  1   0   0  0  0       1 0 An example j = 3 1 2 3 4 5 6 7 8 9 10 T=xabxabaaxa 12345 P=abaac 0   1  U (b)   0    0 0   1 2 3 1 0 1 0 2 0 0 1 3 0 0 0 4 0 0 0 5 0 0 0 4 5 6 7 8 9 1   0   0        1  1  1  BitShift(2) & U (T (3))   0  &  0    0         0  0  0  0  0  0       1 0 An example j = 8 1 2 3 4 5 6 7 8 9 10 T=xabxabaaxa 12345 P=abaac 1    0 U (a )  1    1  0   1 2 3 4 5 6 7 8 1 0 1 0 0 1 0 1 1 2 0 0 1 0 0 1 0 0 3 0 0 0 0 0 0 1 0 4 0 0 0 0 0 0 0 1 5 0 0 0 0 0 0 0 0 9 1  1  1        1   0   0  BitShift(7) & U (T (8))   0  & 1    0        1  1  1   0  0  0       1 0 Correctness  1) 2) For i > 1, Entry M(i,j) = 1 iff The first i-1 characters of P match the i-1characters of T ending at character j-1. Character P(i) ≡ T(j).  1) is true when M(i-1,j-1) = 1. 2) is true when the i’th bit of U(T(j)) = 1.  The algorithm computes the and of these two bits.  Correctness 1 2 3 4 5 6 7 8 9 10 T=xabxabaaxa abaac    1 2 3 4 5 6 7 8 9 1 0 1 0 1 0 0 1 0 1 1 0 1 2 0 0 1 0 0 1 0 0 0 0 3 0 0 0 0 0 0 1 0 0 0 4 0 0 0 0 0 0 0 1 0 0 5 0 0 0 0 0 0 0 0 0 0 M(4,8) = 1, this is because a b a a is a prefix of P of length 4 that ends at position 8 in T. Condition 1) – We had a b a as a prefix of length 3 that ended at position 7 in T ↔ M(3,7) = 1. Condition 2) – The fourth bit of P is the eighth bit of T ↔ The fourth bit of U(T(8)) = 1. How much did we pay?    Formally the running time is Θ(mn). However, the method is very efficient if n is the size of a single or a few computer words. Furthermore only two columns of M are needed at any given time. Hence, the space used by the algorithm is O(n). agrep: The Shift-And Method with errors  We extend the shift-and method for finding inexact occurrences of a pattern in a text.  Reminder example: T = aatatccacaa P = atcgaa P appears in T with 2 mismatches starting at position 4, it also occurs with 4 mismatches starting at position 2. aatatccacaa aatatccacaa atcgaa atcgaa agrep  Our current goal given k find all the occurrences of P in T with up to k mismatches.  We define the matrix Mk to be an n by m binary matrix, such that: Mk (i,j) = 1 iff At least i-k of the first i characters of P match the i characters up through character j of T.   What is M0? How does Mk solve the k-mismatch problem? Computing Mk     We compute Ml for all l=0, … , k. For each j compute M(j), M1(j), … , Mk(j) For all l initialize Ml(0) to the zero vector. The j’th column of Ml is given by: Computing Mk  The first i-1 characters of P match a substring of T ending at j-1, with at most l mismatches, and the next pair of characters in P and T are equal. j-1 * * * * * * * * * * i-1  BitShift( M ( j  1))  U (T ( j )) l Computing Mk  The first i-1 characters of P match a substring of T ending at j-1, with at most l -1 mismatches. j-1 * * * * * * * * * * i-1  BitShift( M l 1 ( j  1)) Computing Mk     We compute Ml for all l=1, … , k. For each j compute M(j), M1(j), … , Mk(j) For all l initialize Ml(0) to the zero vector. The j’th column of Ml is given by: M l ( j)  [ BitShift( M l ( j  1))  U (T ( j ))]  BitShift( M l 1 ( j  1)) Example: M1 1 2 3 4 5 6 7 8 9 10 T=xabxabaaxa P= abaac 1 2 3 4 5 6 7 8 9 1 0 1 1 1 1 1 1 1 1 1 1 1 2 0 0 1 0 0 1 0 1 1 0 3 0 0 0 1 0 0 1 0 0 1 4 0 0 0 0 1 0 0 1 0 0 5 0 0 0 0 0 0 0 0 1 0 1 2 3 4 5 6 7 8 9 1 0 1 0 1 0 0 1 0 1 1 0 1 2 0 0 1 0 0 1 0 0 0 0 3 0 0 0 0 0 0 1 0 0 0 4 0 0 0 0 0 0 0 1 0 0 5 0 0 0 0 0 0 0 0 0 0 M 0= Example: M1 1 2 3 4 5 6 7 8 9 10 T=xabxabaaxa P= abaa 1 2 3 4 5 6 7 8 9 1 0 1 1 1 1 1 1 1 1 1 1 1 2 0 0 1 0 0 1 0 1 1 0 3 0 0 0 1 0 0 1 0 0 1 4 0 0 0 0 1 0 0 1 0 0 5 0 0 0 0 0 0 0 0 1 0 1    0 U (a )  1    1  0   How much did we pay?    Formally the running time is Θ(kmn). Again, the method is practically efficient for small n. Still only a constant number of columns of M are needed at any given time. Hence, the space used by the algorithm is O(n). The match count problem The match-count problem  We want to count the exact number of characters that match each of the different alignments of P with T. aatatccacaa atcgaa 4 aatatccacaa atcgaa 2 The match-count problem    We will first look at a simple algorithm which extends the techniques we’ve seen so far. Next, we introduce a more efficient algorithm that exploits existing efficient methods to calculate the Fourier transform. We conclude with a variation that gives good performance for unbounded alphabets. Match-count Algorithm 1  We define the matrix MC to be an n by m integer valued matrix, such that: MC(i,j) = The number of characters of P[1..i] that match T[jI+1,..,j]  How does MC solve the match-count problem? Computing MC   Initialize column 0 of MC to all zeros For j ≥ 1 column j is obtained by  MC (i  1, j  1) MC (i , j )    MC (i  1, j  1)  1  P (i )  T ( j ) Otherwise Total of Θ(nm) comparisons and (simple) additions. Match-count algorithm 2  Define a vector W that counts the matching symbols, it’s indices are the possible alignments. T=ababcaaa P=abca W(1) = 2 ababcaaa abca W(3) = 4 ababcaaa abca W(5) = 1 ababcaaa abca W(2) = 0 ababcaaa abca W(4) = 1 Match-count algorithm 2  Let’s handle one symbol at a time: T=ababcaaa P=abca Ta = 1 0 1 0 0 1 1 1 Pa = 1 0 0 1 Wa(1) = 1 10100111 1001 Wa(3) = 2 Match-count algorithm 2  We have W = Wa + Wb + Wc.  Or in the general case  W  W   Match-count algorithm 2  We can calculate Wα using a convolution.   Let’s rephrase the problem. X = Tα padded with n zeros on the right. Y = Pα padded with m zeros on the right.  We have two vectors X,Y of length m+n.  Match-count algorithm 2  Ta = 1 0 1 0 0 1 1 1 Pa = 1 0 0 1 X=101001110000 Y=100100000000 Match-count algorithm 2  In our modified representation: W (i )  n  m 1  X ( j )  Y (i  j ) j 0  Where the indices are taken modulo n+m. W(1) = W(2) = < 1 0 1 0 0 1 1 1 0 0 0 0, 100100000000 > <010100111000, 010010000000 > Match-count algorithm 2  In our modified representation: W (i )  n  m 1  X ( j )  Y (i  j ) j 0  Where the indices are taken modulo n+m.  This is the convolution of X and the reverse of Y.  Using FFT calculating convolution takes time O(m log(m)). Match-count algorithm 2  The total running time is O(|∑| m log(m))  What happens if |∑| is large?  For example when |∑| =n, we get O(n m log(m)) which is actually worse than the naïve algorithm. Match-count algorithm 3    An idea: some symbols might appear more often than others. Use convolutions for the frequent symbols. Use a more simple counting method for the rest. Rare symbols    Say α appears less than c times in P. Record the locations of α in P l1,…,lr r ≤c. Go over the text, when we see α at location j we increment W(j-l1+1) , … , W(j-lr+1+1). Rare symbols T = a b a b c a a a… P=abcac l1 = 3, l2 = 5 j = 5 → W(5-3+1)++ W(3)++ a b a b c a a a… abcac W(5-5+1)++ W(1)++ T = a b a b c a a a... abcac How much did we pay?    We can do this for all the rare symbols in one sweep of T, for each position in T we make up to c updates in W. Thus handling the rare symbols will cost us O(cm). For the frequent symbols we pay one convolution per symbol so we pay at most O(n/c m log(m)). Determining c  We choose the c that gives us the best balance n cm  m  log( m )  c c 2  n  log( m )  c  n  log( m )  The total running time is O(m n  log( m) ) References  Dan Gusfield, Algorithms on Strings, Trees and Graphs. Cambridge Univ. Press, Cambridge,1997. The end

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation by Daniel Glasner