Download B - Researchmap

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Succinct Data Structures
Kunihiko Sadakane
National Institute of Informatics
Dynamic Data Structures
• Bit vectors, strings, ordered trees
• Operations: access, rank, select, insert, delete
2
Memory Model
• Memory consists of a bit array (vector) M[0..]
• Consecutive w bits can be read/written in O(1) time
– w: word length of CPU
• Memory consumption of an algorithm is defined as
the maximum memory address accessed by the
algorithm
3
Dynamic Memory Management [1]
• Consider a data structure for this problem
• B: an array of m variable-length bit strings
– B[i] is called block i
– Each block is of length at most b bits
• address(i): returns the address of block i
• realloc(i, b’): changes the length of block i to b’
(address(i) may change)
• Computation model: word RAM with word length w
4
Theorem: Assume b  m, log m  w.
Let s be the sum of lengths of blocks B[1..m].
Then B can be stored in s + O(m log m + b2) bits, and
address is done in O(1) and realloc is done in O(b/w)
time.
Proof: Let p = (log (mb)). p is #bits to represent a
pointer to B.
Divide the memory into segments of b+4p bits.
The unit of memory allocation and deallocation is a
segment. (A middle segment is never deallocated.
The last segment is always allocated/deallocated.) 5
Store segments in doubly-linked lists.
List Lx stores all blocks of length x (1  x  b).
– pred, succ (p bits): addresses of preceding/succeeding
segments
– offset (log b  p bits): address of the first block in the
segment
– block_data (b+p bits): space to store blocks
Lx
block_data
b+p bits
block
block
p bits p bits
pred offset
block
p bits
bl succ
offset
pred offset ock
offset
block
block
block
succ
6
• block stores bit string of B[i] and i
(x + log m  x + p bits)
• block is stored in a segment from right to left. If it
cannot be stored in a segment, a new segment is
allocated. The block is divided into two and each
piece is stored in one of the segment.
• A block can be stored in one or two segments.
(block length)  b + p = (block_data length)
• To enumerate all blocks in a segment, traverse
block_data from position offset.
• B[i] can be stored in a list in arbitrary order.
Another array is used to store pointers to blocks. 7
• For each block B[i] (1  i  m),
– Len[i] (log b  p bits): length of B[i]
– Pos[i] (log (b+p)  p bits): position of block B[i] inside a
segment
– Seg[i], Ind[j] (p bits): segment storing B[i]
Seg
1
1
3
2
5
1
10
2
25
2
100
1
Ind
pred offset
B[1]
B[100]
B[5]
succ
B[25]
B[10]
8
succ
Pos[1]
Pos[100]
Pos[5]
pred offset
B[3]
• For all blocks stored in the same segment, values of
Seg are identical (Seg[i1] = Seg[i2] = … = j).
• Ind[j] represents the actual address of the segment.
Seg
1
1
3
2
5
1
10
2
25
2
100
1
Ind
pred offset
B[1]
B[100]
B[5]
succ
B[25]
B[10]
9
succ
Pos[1]
Pos[100]
Pos[5]
pred offset
B[3]
Implementing address(i)
• Because a block is stored in one or two segments,
let the return value of address(i) be two pairs of
(addr, len).
• The first pair is determined by Ind[Seg[i]], Pos[i],
Len[i].
• If the block does not fit the segment, the second pair
can be found by succ of the segment.
– the rest of the block is stored in the first block in the
segment pointed to by succ of the segment.
• O(1) time
10
Implementing realloc(i, b’)
• Find the current address and length b of block i
• Copy the content of block i to temporary space.
• Move the front block j in Lb to the emptied space.
– update Pos[j] and Seg[j]
• If the head segment of Lb becomes empty, delete it
– move the last segment z in the memory to the emptied
space.
– update Ind[z]
• Insert block i at the head of Lb’. If the head segment
does not have enough space, allocate a new segment.
(memory region used is extended.)
11
• Movement of blocks and segments takesO(b/w) time
• Update of pointers takes O(1) time
Required space
• The sum of block lengths is s
• At most one segment has empty space for each
list Lx
– The number of segments to store all the blocks is
at most s/b + b
• Required space is s + O(b2 + m log m) bits
• This data structure is denoted by D(b, m)
• Note: If the length of a block is w, the redundancy of
12
log m bits for a block is too large.
Theorem: Assume b = O(w), log m  w.
Let s be the sum of lengths of all blocks B[1..m].
Then B is stored in s + O(m log w + w4) bits, and
address and realloc are done in O(1) time.
Proof: Divide B into pieces of w3 elements, and store
each by using D(1+w, w3). Let Di denote it.
Segments used by Di’s are managed by using another
data structure D.
13
• Di stores w3 blocks. Thus it stores w4 bits.
• Divide it into w2 pages.
– Each page has w or zero segments.
•
•
•
•
The number of pages used by all Di is m/w.
Store these pages in data structure D(O(w2), m/w).
address takes O(1) time
realloc in D takes O(w) time, but it occurs after w
occurrence of realloc in Di, the time complexity can
be improved to O(1).
14
Required space
• Each Di uses (block size)+O(w3 log w) bits
• They are summed up to s + O(m log w) bits
• D uses O(w4 + m/w log(m/w)) bits
• In total, s + O(w4 + m log w) bits
If 0,1 vector of length n is stored by this data structure
• s = n, w = log n
• b = log n (length of a block)
• m = s / b (number of blocks)
• Space is nH0 + O(n log log n/log n + log4 n)
15
Dynamic Bit Vectors [2]
• Store a bit-vector B[1..n] of length n
• Operations
– access, rank, select
– insert(i, c): inserts a bit c between B[i] and B[i+1]
– delete(i): deletes B[i]
16
A Simple Data Structure
• Divide the vector into blocks of length between
L and 2L (L = (log2 n))
• Store blocks in a balanced binary search tree in the
order of positions in the vector.
– Blocks are stored in leaves. O(n/log2 n) blocks.
– An internal node stores the number of 1’s in blocks
stored in the subtree rooted at the node.
– Space to store the tree: O(n/log n) bits
• If, by insert, the length of a block exceeds 2L,
partition the block into two.
17
• If, by delete, the length of a block becomes
less than L,
– if one of its adjacent block is of length more than L,
move 1 bit from the block.
if both of its adjacent block are of length L, merge the
block with one of the adjacent ones.
• Operation on balanced binary search tree takes
O(log n) time
• Operations to blocks are also in O(log n) time
• Note: If the value of n becomes twice, “log n” will
change
– If log n changes, all indexes must be reconstructed.
18
Change of “log n”
• Partition the vector into three parts
– left:
use word length w = log n  1
– middle: use word length w = log n
– right: use word length w = log n + 1
• If insert is done: move the rightmost bit of left to
middle, the rightmost bit of middle to right
• If delete is done: move the leftmost bit of right to
middle, the leftmost bit of middle to left
• If the value of n is doubled, left becomes empty and
middle, right become new left, middle.
19
• Reconstruction of indexes is needless.
Theorem: A bit-vector of length n is stored in
nH0 + O(n /log n) bits, and access, rank, select, insert,
delete are done in O(log n) time.
20
Dynamic Ordered Trees [2]
• Tree structure is represented by BP, and stored like
the dynamic bit vector.
• Each node of the balanced binary search tree stores
not only the number of 1’s, but also values in nodes
of the range min-max tree.
Theorem: An n node dynamic ordered tree is
represented in 2n + O(n /log n) bits, and all
operations are done in O(log n) time.
21
Faster Data Structures
• As the range min-max tree, A B-tree with
branching factor between log n and 2 log n
– The depth of the range min-max tree becomes
O(log n /log log n)
• L = log2 n /log log n
22
Time for the Operations
• P[i], findclose, findopen, enclose, rmq, pre_rank,
pre_select, isleaf, isancestor, depth, parent,
first_child, last_child, next_sibling, prev_sibling,
subtree_size, lca, deepest_node, height, in_rank,
in_select, leaf_rank, leaf_select, leftmost_leaf,
rightmost_leaf: O(log n / log log n) time
• level_ancestor, level_next, level_leftmost,
level_prev, level_rightmost: O(log n) time
• insert, delete: O(log n/ log log n) or O(log n) time
• degree, child, child_rank: O(q log n/ log log n)
23
or O(log n) time (q = degree)
References
[1] Jesper Jansson, Kunihiko Sadakane, Wing-Kin Sung. Compressed
random access memory, arXiv:1011.1708v1.
[2] Kunihiko Sadakane, Gonzalo Navarro: Fully-Functional Succinct
Trees. SODA 2010: 134-149.
[3] Jacob Ziv and Abraham Lempel; Compression of Individual
Sequences Via Variable-Rate Coding, IEEE Transactions on
Information Theory, September 1978.
[4] S. Rao Kosaraju, Giovanni Manzini: Compression of Low Entropy
Strings with Lempel-Ziv Algorithms. SIAM J. Comput. 29(3): 893911 (1999).
[5] Rodrigo González and Gonzalo Navarro. Statistical Encoding of
Succinct Data Structures. Proc. CPM'06, pages 295-306. LNCS 4009.
[6] P. Ferragina and R. Venturini. A simple storage scheme for strings
achieving entropy bounds. Theoretical Computer Science, 372(1):115–
24
121, 2007.