Download PPT

Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat Motivation – Point Set Matching  Integer 1-D Point Set Matching:  T: (t1,t2,…,tn)  P: (p1,p2,…,pm)  Where ti and pi are integers.  Let N=tn, M=pm. (the maximal index)  Time: O(nm), O(N·log(M)) Motivation – Point Set Matching  2-D Point Set Matching – Searching in Music:  T: (i1,j1),(i2,j2),…,(in,jn)  P: (i1,j1),(i2,j2),…,(im,jm) Pattern  Dimension Reduction: (i,j) →i·N + j Text Motivation – Generalized Case  The generalized case of these problems is the d- Dimensional sparse wildcard matching problem.  Problem Definition: Given d-Dimensional text T with zeros and non-zeros, and a d-Dimensional pattern P with wildcards and non-zeros. Find all the locations where P matches T.  Applications: d-Dimensional point set matching, searching in music, protein activity research, etc. Length Reduction  Goal: Given two vectors V1&V2, obtain two vectors V’1&V’2 of size O(n1) such that all non-zero in V1 and in V2 will appear as singletons in respectively while maintaining the distance property.  The Distance Property: If V’2[f(0)] is aligned with V’1[f(i)], then V’2[f(j)] will be aligned with V’1[f(i + j)].  Using the reduced size vectors, matching can be done in time O(n1log(n1)) using convolutions. Example: Length Reduction The vectors are given as sets of pairs: (index, value). V1: (0, 5), (6, 2), (13, 3), (19, 1) V2: (0, 2), (7, 3) Length Reduction Function: mod(5) V’1: 5 2 0 3 1 V’2: 2 0 3 0 0 The Randomized Algorithm (Cole & Hariharan – STOC02)  Idea: Find a set of log(n) short vectors, in which with high probability, each non-zero in V, appears as a singleton in at least one of the vectors.  Hash functions: (ax mod(q))mod(s). Where q is a large prime number, and s is O(n).  If s is c·n, then the probability of a non-zero appearing as a multiple is constant.  Using log(n) different hash functions will reduce the failure probability exponentially. The Randomized Algorithm Sources of Errors 1. Some non-zeros may appear only as multiples in all the set of vectors. 2. The non-zero from the text which was aligned with the non-zero from the pattern came from a different index (false matches). 3. This algorithm was created for matching, but in convolution each non-zero should be calculated only once. Deterministic Length Reduction  Our Goal: Find a set of log(n) hash functions, which will ensure that each non-zero appears as a singleton at least once.  Finding the hash functions is done in a preprocessing step based on V1.  The algorithm distinguish between 2 cases:   N1 is polynomial in n1. N1 is exponential in n1. The Polynomial case: N<nc  Let q be a prime number of size O(n), and mod(q) be the suggested hash function.  Let i,j be the indices of two non-zeros.  Observation: If i and j are mapped into the same location, it means that q divides dij.  Observation: There are at most c prime numbers of size O(n), which divides dij.  Corollary: A non-zero can appear as a multiple in at most c·n prime numbers. Choosing Prime Numbers  Test 2c·n prime numbers (of size O(nlogn) ), and build the following table:  Each column represents a non-zero (n columns).  Each row represents a prime number (2c·n rows).  Reminder: Each non-zero can appear as a multiple at most c·n times.  Corollary: The table is at least half full with ones. NZ1 NZ2 NZ3 NZ4 NZ5 P1 1 0 0 1 0 P2 1 1 0 1 0 P3 0 0 0 0 1 P4 1 0 1 0 1 P5 0 1 1 1 0 P6 0 1 0 0 1 P7 1 0 1 1 0 P8 0 0 1 1 1 P9 1 1 0 0 0 P10 0 1 1 0 1 P11 0 1 0 1 1 P12 1 0 1 0 0 Choosing Prime Numbers: Cont. 1. 2. 3. Select a prime number which generates a row that is at least half full. (for example P2) Delete the row and all the columns in which there was 1 in the deleted row. Repeat steps 1 and 2 until the whole table is deleted Slected Primes: P2, P4, Time: O(n2) NZ31 NZ52 NZ3 NZ4 NZ5 P1 01 0 0 1 0 P32 01 1 0 1 0 P43 1 0 10 0 0 1 P54 1 0 1 0 1 P65 0 1 1 1 0 P76 10 01 0 0 1 P87 1 10 1 1 0 P98 0 0 1 1 1 PP10 9 1 1 0 0 0 P10 11 0 1 1 0 1 P12 11 10 01 0 1 1 P12 1 0 1 0 0 The Exponential Case: n<2n  Idea: Reduce the length of the vector to polynomial and continue with the previous algorithm.  Any distance dij can be divided by at most n prime numbers.  There are at most n2 different distances.  Corollary: There are at most n3 prime numbers which generates multiples. The Reduction Algorithm. 1. Choose a prime number q of size O(n4). 2. Create the reduced size vector using the mod(q) hash function. 3. Repeat steps 1&2 if a multiple was created. 4. Duplicate the obtained vector (create a vector of size 2q), to allow further reduction of the vector. Time: O(n4) The Randomized Algorithm Sources of Errors 1. Some non-zeros may appear only as multiples in all the set of vectors. 2. The non-zero from the text which was aligned with the non-zero from the pattern came from a different index (false matches). 3. This algorithm was created for matching, but in convolution each non-zero should be calculated only once. The Convolution Algorithm 1. For each prime number Pi: 1. 2. 3. 4. 2. Create the reduced size vectors V’1,i &V’2,i using the indices of the non-zeros and perform shift matching. Create the reduced size vectors V’1,i &V’2,i using 1’s instead of the non-zeros and perform convolution. Create the reduced size vectors V’1,i &V’2,i using the values of the non-zeros and perform convolution. Zero the value of the non-zeros appeared as singletons. For all indices where shift matching was found: 1. 2. Sum the results of the 1’s convolutions. If the result is n2 then sum the results of the values convolutions and report the result. Time: O(nlog3(n)) Example V1: (0, 5), (5, 2), (13, 3), (20, 1) V2: (0, 2), (8, 3) Prime Numbers: 5,7 V’1,1: 0 0 0 13 0 0 0 0 1 0 0 0 0 3 0 V’2,1: 8 0 1 0 0 1 0 2 0 0 3 0 ‘0’ 0 0 (5, 1, 9), (13, 1, 6) V’1,2: ‘0’ 0 0 0 0 V’2,2: ‘0’ 8 0 5 0 0 0 0 2 0 0 0 0 0 0 2 3 0 0 0 0 0 (0, 1, 10), (5, 1, 4) 5 Conclusions and Open Problems  A deterministic algorithm for length reduction and fast convolution was presented.   Preprocessing time: O(n2) – Polynomial case, O(n4) – Exponential case. Running time: O(nlog2n)  Open problems:  Can the preprocessing time be reduced?  Can the size of the vectors be reduced?  Can the number of vectors be reduced? Thank You! Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT