Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Succinct Data Structures Kunihiko Sadakane National Institute of Informatics Dynamic Data Structures • Bit vectors, strings, ordered trees • Operations: access, rank, select, insert, delete 2 Memory Model • Memory consists of a bit array (vector) M[0..] • Consecutive w bits can be read/written in O(1) time – w: word length of CPU • Memory consumption of an algorithm is defined as the maximum memory address accessed by the algorithm 3 Dynamic Memory Management [1] • Consider a data structure for this problem • B: an array of m variable-length bit strings – B[i] is called block i – Each block is of length at most b bits • address(i): returns the address of block i • realloc(i, b’): changes the length of block i to b’ (address(i) may change) • Computation model: word RAM with word length w 4 Theorem: Assume b m, log m w. Let s be the sum of lengths of blocks B[1..m]. Then B can be stored in s + O(m log m + b2) bits, and address is done in O(1) and realloc is done in O(b/w) time. Proof: Let p = (log (mb)). p is #bits to represent a pointer to B. Divide the memory into segments of b+4p bits. The unit of memory allocation and deallocation is a segment. (A middle segment is never deallocated. The last segment is always allocated/deallocated.) 5 Store segments in doubly-linked lists. List Lx stores all blocks of length x (1 x b). – pred, succ (p bits): addresses of preceding/succeeding segments – offset (log b p bits): address of the first block in the segment – block_data (b+p bits): space to store blocks Lx block_data b+p bits block block p bits p bits pred offset block p bits bl succ offset pred offset ock offset block block block succ 6 • block stores bit string of B[i] and i (x + log m x + p bits) • block is stored in a segment from right to left. If it cannot be stored in a segment, a new segment is allocated. The block is divided into two and each piece is stored in one of the segment. • A block can be stored in one or two segments. (block length) b + p = (block_data length) • To enumerate all blocks in a segment, traverse block_data from position offset. • B[i] can be stored in a list in arbitrary order. Another array is used to store pointers to blocks. 7 • For each block B[i] (1 i m), – Len[i] (log b p bits): length of B[i] – Pos[i] (log (b+p) p bits): position of block B[i] inside a segment – Seg[i], Ind[j] (p bits): segment storing B[i] Seg 1 1 3 2 5 1 10 2 25 2 100 1 Ind pred offset B[1] B[100] B[5] succ B[25] B[10] 8 succ Pos[1] Pos[100] Pos[5] pred offset B[3] • For all blocks stored in the same segment, values of Seg are identical (Seg[i1] = Seg[i2] = … = j). • Ind[j] represents the actual address of the segment. Seg 1 1 3 2 5 1 10 2 25 2 100 1 Ind pred offset B[1] B[100] B[5] succ B[25] B[10] 9 succ Pos[1] Pos[100] Pos[5] pred offset B[3] Implementing address(i) • Because a block is stored in one or two segments, let the return value of address(i) be two pairs of (addr, len). • The first pair is determined by Ind[Seg[i]], Pos[i], Len[i]. • If the block does not fit the segment, the second pair can be found by succ of the segment. – the rest of the block is stored in the first block in the segment pointed to by succ of the segment. • O(1) time 10 Implementing realloc(i, b’) • Find the current address and length b of block i • Copy the content of block i to temporary space. • Move the front block j in Lb to the emptied space. – update Pos[j] and Seg[j] • If the head segment of Lb becomes empty, delete it – move the last segment z in the memory to the emptied space. – update Ind[z] • Insert block i at the head of Lb’. If the head segment does not have enough space, allocate a new segment. (memory region used is extended.) 11 • Movement of blocks and segments takesO(b/w) time • Update of pointers takes O(1) time Required space • The sum of block lengths is s • At most one segment has empty space for each list Lx – The number of segments to store all the blocks is at most s/b + b • Required space is s + O(b2 + m log m) bits • This data structure is denoted by D(b, m) • Note: If the length of a block is w, the redundancy of 12 log m bits for a block is too large. Theorem: Assume b = O(w), log m w. Let s be the sum of lengths of all blocks B[1..m]. Then B is stored in s + O(m log w + w4) bits, and address and realloc are done in O(1) time. Proof: Divide B into pieces of w3 elements, and store each by using D(1+w, w3). Let Di denote it. Segments used by Di’s are managed by using another data structure D. 13 • Di stores w3 blocks. Thus it stores w4 bits. • Divide it into w2 pages. – Each page has w or zero segments. • • • • The number of pages used by all Di is m/w. Store these pages in data structure D(O(w2), m/w). address takes O(1) time realloc in D takes O(w) time, but it occurs after w occurrence of realloc in Di, the time complexity can be improved to O(1). 14 Required space • Each Di uses (block size)+O(w3 log w) bits • They are summed up to s + O(m log w) bits • D uses O(w4 + m/w log(m/w)) bits • In total, s + O(w4 + m log w) bits If 0,1 vector of length n is stored by this data structure • s = n, w = log n • b = log n (length of a block) • m = s / b (number of blocks) • Space is nH0 + O(n log log n/log n + log4 n) 15 Dynamic Bit Vectors [2] • Store a bit-vector B[1..n] of length n • Operations – access, rank, select – insert(i, c): inserts a bit c between B[i] and B[i+1] – delete(i): deletes B[i] 16 A Simple Data Structure • Divide the vector into blocks of length between L and 2L (L = (log2 n)) • Store blocks in a balanced binary search tree in the order of positions in the vector. – Blocks are stored in leaves. O(n/log2 n) blocks. – An internal node stores the number of 1’s in blocks stored in the subtree rooted at the node. – Space to store the tree: O(n/log n) bits • If, by insert, the length of a block exceeds 2L, partition the block into two. 17 • If, by delete, the length of a block becomes less than L, – if one of its adjacent block is of length more than L, move 1 bit from the block. if both of its adjacent block are of length L, merge the block with one of the adjacent ones. • Operation on balanced binary search tree takes O(log n) time • Operations to blocks are also in O(log n) time • Note: If the value of n becomes twice, “log n” will change – If log n changes, all indexes must be reconstructed. 18 Change of “log n” • Partition the vector into three parts – left: use word length w = log n 1 – middle: use word length w = log n – right: use word length w = log n + 1 • If insert is done: move the rightmost bit of left to middle, the rightmost bit of middle to right • If delete is done: move the leftmost bit of right to middle, the leftmost bit of middle to left • If the value of n is doubled, left becomes empty and middle, right become new left, middle. 19 • Reconstruction of indexes is needless. Theorem: A bit-vector of length n is stored in nH0 + O(n /log n) bits, and access, rank, select, insert, delete are done in O(log n) time. 20 Dynamic Ordered Trees [2] • Tree structure is represented by BP, and stored like the dynamic bit vector. • Each node of the balanced binary search tree stores not only the number of 1’s, but also values in nodes of the range min-max tree. Theorem: An n node dynamic ordered tree is represented in 2n + O(n /log n) bits, and all operations are done in O(log n) time. 21 Faster Data Structures • As the range min-max tree, A B-tree with branching factor between log n and 2 log n – The depth of the range min-max tree becomes O(log n /log log n) • L = log2 n /log log n 22 Time for the Operations • P[i], findclose, findopen, enclose, rmq, pre_rank, pre_select, isleaf, isancestor, depth, parent, first_child, last_child, next_sibling, prev_sibling, subtree_size, lca, deepest_node, height, in_rank, in_select, leaf_rank, leaf_select, leftmost_leaf, rightmost_leaf: O(log n / log log n) time • level_ancestor, level_next, level_leftmost, level_prev, level_rightmost: O(log n) time • insert, delete: O(log n/ log log n) or O(log n) time • degree, child, child_rank: O(q log n/ log log n) 23 or O(log n) time (q = degree) References [1] Jesper Jansson, Kunihiko Sadakane, Wing-Kin Sung. Compressed random access memory, arXiv:1011.1708v1. [2] Kunihiko Sadakane, Gonzalo Navarro: Fully-Functional Succinct Trees. SODA 2010: 134-149. [3] Jacob Ziv and Abraham Lempel; Compression of Individual Sequences Via Variable-Rate Coding, IEEE Transactions on Information Theory, September 1978. [4] S. Rao Kosaraju, Giovanni Manzini: Compression of Low Entropy Strings with Lempel-Ziv Algorithms. SIAM J. Comput. 29(3): 893911 (1999). [5] Rodrigo González and Gonzalo Navarro. Statistical Encoding of Succinct Data Structures. Proc. CPM'06, pages 295-306. LNCS 4009. [6] P. Ferragina and R. Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115– 24 121, 2007.