Download Succinct Indexes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Succinct Representations
of Dynamic Strings
Meng He and J. Ian Munro
University of Waterloo
Background: Succinct Data Structures

What are succinct data structures (Jacobson 1989)



Representing data structures using ideally
information-theoretic minimum space
Supporting efficient navigational operations
Why succinct data structures

Large data sets in modern applications: textual,
genomic, spatial or geometric
Strings: Definitions

Notation



Alphabet: [σ]={1, 2, …, σ}
String: S[1..n]
Operations:



access(i): S[i]
rank(α, i): number of occurrences of α in S[1..i]
select(α, i): position of the ith occurrence of α in S
Strings: An Example
S=aabacccdaddabbbc
string_access(8) = d
string_rank(a, 8) = 3
string_select(b, 3) = 14
Succinct Representations of Strings

Information-theoretic minimum: n lg σ bits

Succinct representation (Grossi et al. 2003)




Space: n H0 +o(n)∙lg σ bits
Time: O(lg σ)
There are many more results.
The case in which σ = 2 (bit vector) is even more
fundamental!

Jacobson 1989
Applications of Strings and Bit Vectors

Ordinal trees on n nodes



Munro & Raman 1997, Benoit et al. 1999…)
Full text indexes for text string from [σ]n



Standard approach: 3n lg n bits
Succinct data structures: 2n + o(n) bits (Jacobson 1989,
Suffix trees can use as much as 4n lg n to 6n lg n bits!
Succinct data structures: n lg σ +o(n lg σ) bits (Grossi et
al. 2003, González and Navarro 2009…)
Labeled trees, planar graphs, binary relations,
permutations, functions, …
Our Problem: Dynamic Strings

Motivation: In many applications, data are also
updated frequently

For strings, we also consider the following
update operations:

insert(α, i), which inserts character α between S[i-1]
and S[i]

delete(i), which deletes S[i] from S
Comparisons
Gupta et al.
2007
Space (bits)
Access, rank and
select
Insert and delete
n lg σ +lg
σ∙(o(n)+O(1))
O(lg lg n)
O(nε) amortized
O(lg n lg σ)
O(lg n lg σ)
lg σ
O(lg n ( ──── + 1))
lg lg n
lg σ
O(lg n ( ──── + 1))
lg lg n
amortized
lg σ
O(lg n ( ──── + 1))
lg lg n
Mäkinen &
n H0 +o(n)∙lg σ
Navarro 2008
Lee & Park
2009
n lg σ +o(n)∙lg σ
González and n H0 +o(n)∙lg σ
Navarro 2009
This paper
n H0 +o(n)∙lg σ
lg σ
O(lg n ( ──── + 1))
lg lg n
lg n
lg σ
lg n
lg σ
O(──── ( ────
+
1))
O(────
(
────
+ 1))
lg lg n lg lg n
lg lg n lg lg n
For the special cases in which σ = polylog (n) or 2 (bit vector!), our results
also improve previous results
Searchable Partial Sums

Data


Operations




A sequence Q of n nonnegative integers
sum(i): Q[1] + Q[2] + … + Q[i]
search(x): the smallest i such that sum(i) ≥ x
update(i, δ): Q[i] ← Q[i] + δ
Raman et al. 2001



Assumptions: |Q| = O(lgε n), |δ| ≤ lg n
Space: O(lg1+ε n) bits, with a universal table of size O(nε’) bits
Operations: O(1) time
Collections of Searchable Partial Sums

Data

d sequences of k-bit nonnegative integers of length n each

Operations

sum, search, update: supported on each sequence
 insert, delete: operated simultaneously on the same positions of
all the sequences, but only 0’s can be inserted or deleted
González and Navarro 2009 (CSPSI)

8
2
9
5
11
0
5
12
0
3
1
9
0
7
3
6
0
19
0
4
2
8
1
5
3
12
4
0
3
5
4
1
0
sum(2, 5) = 25
insert(6)
delete(6)
Our results on CSPSI

Assumptions



Space



d = O(lgη n)
|δ| ≤ lg n
O(kdn + w) bits, where w is the word size
Buffer: O(n lg n) bits
Time

lg n
All operations: O ( ──── )
lg lg n
Data Structures for Dynamic Strings Over a
Small Alphabet of size O(lg1/2 n)


Main data structure: a B-tree constructed over S
Leaf




Each leaf stores a superblock
of at most 2L bits which encodes a
2
lg n )
substring of S (L = ────
lg lg n
The numbers of occurrences of each character in all the
superblocks form an integer sequence
Maintain the above sequences for all the characters in the
alphabet in a CSPSI structure E
Internal node v (lg1/2 n ≤ degree(v) ≤ 2lg1/2 n)


U(v): U(v)[i] = number of leaves of the subtree rooted at the i-th
child of v
I(v): I(v)[i] = number of characters stored in the subtree rooted at
the i-th child of v
Supporting Queries

rank(α, i)





Perform a top-down traversal with
the help of I(v)’s
Locate the superblock, j, containing
S[i] with the help of U(v)’s
Perform sum(α, j) operation on E to
count the number of occurrences of α
in superblocks 1, 2, … j-1
Read superblock j in blocks of size
(lg n) / 2 bits
The support for access and select
is similar
v
…
…
Insert, delete and deamortization

Supporting insert and delete requires traversing
and updating the B-tree and updating E

It is however much more complicated


Merging and splitting B-tree nodes
Deamortization
Succinct Global Rebuilding


A key technique for deamortizing operations on B-trees
is global rebuilding (Overmars and van Leeuwen 1981)
Global rebuilding




Rebuild the B-tree after the number of update operations
performed exceeds half the initial length of the string
A new copy and an old copy of the B-tree: more space
A buffer of O(n lg n) bits is required
Succinct global rebuilding



Only one copy of the data: no duplication
During rebuilding, queries and updates are performed on either
the new part or the old part
No buffer required
Putting Everything Together

Dynamic strings over an alphabet of size O(lg1/2 n)



This can be extended to general alphabets using wavelet
trees



Space: n H0 +o(n)∙lg σ bits
lg n
Time: O ( ────
lg lg n )
Space: n H0 +o(n)∙lg σ bits
lg n
lg σ
Time: O(────
( ────
lg lg n + 1))
lg lg n
When σ = polylog (n) or 2 (bit vectors)


Space: n H0 +o(n)∙lg σ bits
lg n
Time: O ( ────
lg lg n )
Applications

Dynamic text collections


Data: a collection of text strings
Operations




Pattern search
Display a substring
Insert/delete a text string
Compressed construction of full-text indexes


Working space: n Hk +o(n)∙lg σ bits
n lg n
lg σ
Time: O(────
( ────
lg lg n + 1))
lg lg n
Conclusions
We designed a succinct representation of
dynamic strings that provide more efficient
operations than previous results
 This structure can be directly applied to improve
previous results on text indexing
 We expect our results to play an important role
in the design of dynamic succinct data structures
 We expect succinct global rebuilding to be useful
for the deamotization of algorithms on dynamic
succinct data structures

Thank you!
Related documents