Download Cell Probe Lower Bounds for Succinct Data Structures

Paweł Gawrychowski* and Pat Nicholson** *University of Warsaw **Max-Planck-Institut für Informatik Range Queries in Arrays  Input: an array 𝐴[1. . 𝑛]  Preprocess the array to answer queries of the form “Given a range [𝑖, 𝑗] find _____ in the subarray 𝐴[𝑖. . 𝑗]”  Where ______ is something like:     the index of the maximum/minimum element the index of the top-𝑘 values the index of the 𝑘-th largest/smallest number find the maximum sum range 𝑖 ′ , 𝑗 ′ ⊆ [𝑖, 𝑗] Encoding Range Queries in Arrays  How much space do we need to answer these queries?  As an example, think of range min. queries (RMinQ):  If we return the value of min, then we must store the array. Why?   Because we can ask the query [𝑖, 𝑖] for each 𝑖 ∈ [1, 𝑛] This allows us to recover the entire array  If we return just the array index, then we can do much better. There is a succinct data structure that occupies 2𝑛 + 𝑜(𝑛) bits, and answers queries in constant time. Fischer and Heun (2011) Typical Data Structure Input Data (Relatively Big) Typical Data Structure Input Data (Relatively Big) Preprocess Data Structure Encoding Approach Input Data (Relatively Big) Encoding Approach Preprocess w.r.t. Some Query Input Data (Relatively Big) Encoding (Hope: much smaller) Encoding Approach Encoding (Hope: much smaller) Encoding Approach Encoding (Hope: much smaller) Auxiliary Data Structures: (Should be smaller still) Encoding Approach Succinct Data Structure: Minimum Space Possible Encoding (Hope: much smaller) Auxiliary Data Structures: (Should be smaller still) Encoding Approach Succinct Data Structure: Minimum Space Possible Auxiliary Data Structures: Encoding (Hope: much smaller) (Should be smaller still) Query (Hope: as fast as nonsuccinct counterpart) This Talk: Maximum-Sum Segments  From Jon Bentley’s “Programming Pearls”:  Input: an array 𝐴[1. . 𝑛] containing arbitrary numbers 𝑗  Output: the range [𝑖, 𝑗] s.t. 𝑘=𝑖 𝐴[𝑘] is maximized   Only non-trivial if array contains negative numbers Can be solved in linear time (credited to Kadane)  Applications:   Bentley [1986]: “[problem] is a toy – it was never incorporated into a system.” Chen and Chao [2004]: “…plays an important role in sequence analysis.”  We focus on the range query case: 𝑖′, 𝑗′ 𝑗′ 𝑘=𝑖 ′ 𝐴[𝑘] ⊆ [𝑖, 𝑗] s.t. is maximized  Also motivated by biological sequence analytics applications  Find range Range Maximum-Sum Segment Queries  What was known:  Chen and Chao [ISAAC 2004, Disc. App. Math. 2007]  This can be done in Θ(𝑛) words of space and Θ(1) time  Very closely related to the range maximum problem:   RMSSQ → RMaxQ: Pad elements with large negative numbers RMinQ/RMaxQ → RMSSQ: More complicated argument Range Maximum-Sum Segment Queries  What was not known:  Is there an efficient encoding structure for this problem?  That is: can we beat Θ(𝑛) words Range Maximum-Sum Segment Queries  Our main results: I. We can encode these queries using Θ(𝑛) bits (Rest of this talk) II. A space lower bound of 1.89113𝑛 bits (Enumeration argument using methods from: ) III. Application to computing 𝑘-covers (𝑘 disjoint subranges that achieve the maximum sum) Csűrös: “The problem arises in DNA and protein segmentation, and in postprocessing of sequence alignments.” Main Idea: Θ(𝑛) word solution  Define an array 𝐶 consisting of the partial sums of 𝐴 Main Idea: Θ(𝑛) word solution  Imagine shooting a ray from each 𝐶 𝑖 to the left Main Idea: Θ(𝑛) word solution  Imagine shooting a ray from each 𝐶 𝑖 to the left Main Idea: Θ(𝑛) word solution  Now find the minimum in this range Main Idea: Θ(𝑛) word solution  Now find the minimum in this range Main Idea: Θ(𝑛) word solution  Define another array 𝑃 storing these minima 𝑖 𝑃[𝑖] Candidate Pairs  We call each pair (𝑃 𝑖 , 𝑖) a candidate  We define (yet another) array 𝐷 as follows:  𝐷[𝑖] is the score of the candidate (𝑃 𝑖 , 𝑖)  That is: the sum within the range [𝑃 𝑖 + 1, 𝑖] What Do They Store? 1) The array 𝐶: Θ(𝑛) words (Cumulative Sums) 2) The array 𝑃: Θ(𝑛) words (Candidate partners) 3) Range min (RMinQ) structure on 𝐶: 2𝑛 + 𝑜(𝑛) bits 4) Range max (RMaxQ) structure on 𝐷: 2𝑛 + 𝑜 𝑛 bits (Candidate Scores) Main Idea: Θ(𝑛) word solution  How to answer a query: the easy case Main Idea: Θ(𝑛) word solution  Let 𝑥 = RMaxQ(𝐷, 𝑖, 𝑗), and examine candidate pair 𝑥 𝑃[𝑥] Main Idea: Θ(𝑛) word solution  If 𝑃 𝑥 + 1 is in query range, return [𝑃 𝑖 + 1, 𝑥] 𝑥 𝑃[𝑥] Main Idea: Θ(𝑛) word solution  How to answer a query: the not so easy case Main Idea: Θ(𝑛) word solution  Let 𝑥 = RMaxQ(𝐷, 𝑖, 𝑗)… this time 𝑃 𝑥 + 1 ∉ [𝑖, 𝑗] 𝑥 𝑃[𝑥] Main Idea: Θ(𝑛) word solution  Let 𝑡 = RMinQ 𝐶, 𝑖, 𝑥 … 𝑥 𝑡 Main Idea: Θ(𝑛) word solution  Let 𝑡 = RMinQ 𝐶, 𝑖, 𝑥 and 𝑦 = RMaxQ 𝐷, 𝑥 + 1, 𝑗 𝑥 𝑡 𝑦 𝑃[𝑦] Main Idea: Θ(𝑛) word solution  Return the greater sum: [𝑡 + 1, 𝑥] or [𝑃 𝑦 + 1, 𝑦] 𝑥 𝑡 𝑦 𝑃[𝑦] Reducing the Space  What are the bottlenecks in the data structure? Storing the array 𝑃 I.  We need to store the candidate pairs Storing the array 𝐶 II.  We must compare scores of candidates in the not so easy case Dealing with 𝑃: Bottleneck I Dealing with 𝑃: Bottleneck I Nested Is Good  Imagine indices as 𝑛 vertices, candidate pairs as edges  We can represent an 𝑛-edge nested graph in 4𝑛 bits  Also known as a one-page or outerplanar graph  Navigation is efficient: select vertices, follow edges, etc.  Jacobson (1989), Munro and Raman (2001); 4𝑛 + 𝑜(𝑛) bits 1 ()(( 2 ())( 3 ()( 4 ())) 5 ())(( 6 ()( 7 8 ())) ()) Dealing with 𝑃: Bottleneck I Dealing with 𝐶: Bottleneck II 𝑦 𝑃[𝑦] Dealing with 𝐶: Bottleneck II 𝑦 ℓ 𝑃[𝑦] Dealing with 𝐶: Bottleneck II We call the point ℓ the left sibling of (𝑃 𝑦 , 𝑦) Knowing ℓ, we can handle the not so easy case. 𝑦 ℓ 𝑃[𝑦] Recall The Query Algorithm  Return the greater sum: [𝑡 + 1, 𝑥] or [𝑃 𝑦 + 1, 𝑦] 𝑥 𝑡 𝑦 𝑃[𝑦] Recall The Query Algorithm  Return the greater sum: [𝑡 + 1, 𝑥] or [𝑃 𝑦 + 1, 𝑦] If the left sibling of (𝑃 𝑦 , 𝑦) is < 𝑡 𝑥 𝑡 𝑦 𝑃[𝑦] Recall The Query Algorithm  Return the greater sum: [𝑡 + 1, 𝑥] or [𝑃 𝑦 + 1, 𝑦] If the left sibling of (𝑃 𝑦 , 𝑦) is ∈ [𝑡, 𝑥] 𝑥 𝑡 𝑦 𝑃[𝑦] Recall The Query Algorithm  Return the greater sum: [𝑡 + 1, 𝑥] or [𝑃 𝑦 + 1, 𝑦] Left sibling of (𝑃 𝑦 , 𝑦) can’t be here 𝑥 𝑡 𝑦 𝑃[𝑦] Dealing with 𝐶: Bottleneck II Problem: cannot store the left siblings explicitly 𝑦 ℓ 𝑃[𝑦] Dealing with 𝐶: Bottleneck II Idea: try to find something that is nested 𝑦 ℓ 𝑃[𝑦] Dealing with 𝐶: Bottleneck II Solution: the pairs (ℓ, 𝑃 𝑦 ) are nested 𝑦 ℓ 𝑃[𝑦] Dealing with 𝐶: Bottleneck II Dealing with 𝐶: Bottleneck II What Do We Store? 1) The graph representing candidates: 4𝑛 + 𝑜(𝑛) bits 2) The graph representing left siblings: 4𝑛 + 𝑜(𝑛) bits 3) Range min (RMinQ) structure on 𝐶: 2𝑛 + 𝑜(𝑛) bits 4) Range max (RMaxQ) structure on 𝐷: 2𝑛 + 𝑜 𝑛 bits Grand total: 12𝑛 + 𝑜(𝑛) bits… (can be reduced slightly with more tricks)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Cell Probe Lower Bounds for Succinct Data Structures