Download n - Researchmap

Succinct Data Structures Kunihiko Sadakane National Institute of Informatics Succinct Data Structures • Succinct data structures = succinct representation of data + succinct index • Examples – – – – sets trees, graphs strings permutations, functions 2 Succinct Representation • A representation of data whose size (roughly) matches the information-theoretic lower bound • If the input is taken from L distinct ones, its information-theoretic lower bound is log L bits • Example 1: a lower bound for a set S  {1,2,...,n} – log 2n = n bits  n=3 {1} {2} {3} {1,2} {1,3} {2,3} Base of logarithm is 2 {1,2,3} 3 • Example 2: n node ordered tree 1  2n  1    2n  log n  bits log 2n  1  n  1  n=4 • Example 3: length-n string – n log  bits (: alphabet size) 4 • For any data, there exists its succinct representation – enumerate all in some order, and assign codes – can be represented in log L bits – not clear if it supports efficient queries • Query time depends on how data are represented – necessary to use appropriate representations n=4 000  1  2n  1    log 5  3 bits log  2n  1  n  1  001 010 011 100 5 Succinct Indexes • Auxiliary data structure to support queries • Size: o(log L) bits • (Almost) the same time complexity as using conventional data structures • Computation model: word RAM – assume word length w = log log L (same pointer size as conventional data structures) 6 word RAM • word RAM with word length w bits supports – reading/writing w bits of memory at arbitrary address in constant time – arithmetic/logical operations on two w bits numbers are done in constant time – arithmetic ops.: +, , *, /, log (most significant bit) – logical ops.: and, or, not, shift • These operations can be done in constant time using O(w) bit tables ( > 0 is an arbitrary constant) 7 1. Bit Vectors • B: 0,1 vector of length n B[0]B[1]…B[n1] • lower bound of size = log 2n = n bits • queries – rank(B, x): number of ones in B[0..x]=B[0]B[1]…B[x] – select(B, i): position of i-th 1 from the head (i  1) • basics of all succinct data structures • naïve data structure – store all the answers in arrays – 2n words (2 n log n bits) 0 3 5 9 B = 1001010001000000 – O(1) time queries n = 16 8 Succinct Index for Rank:1 • Divide B into blocks of length log2 n • Store rank(x)’s at block boundaries in array R x / log2 n  rank ( x)   B[i]   B[i]  x i 1 i 1 x   B[i]  i  x / log2 n 1 R[x/log2 n] • Size of R n n  log n  bits 2 log n log n • rank(x): O(log2 n) time 9 Succinct Index for Rank:2 • Divide each block into small blocks of length ½log n • Store rank(x)’s at small block boundaries in R2 x / 12 logn  x x / log2 n  rank ( x)   B[i]   B[i]   B[i]   B[i] i 1 i 1 i  x / 12 logn 1 i  x / log2 n 1 x R1[x/log2 n] • Size of R2  R2[x/log n]  n 4n log log n 2  log log n  bits 1 log n 2 log n • rank(x): O(log n) time 10 Succinct Index for Rank:3 • Compute answers for all possible queries for small blocks in advance, and store them in a table x x  B[i] i  x / 12 log n 1 All patterns of ½ log n bits • Size of table 2 1 log n 2   12 log n  log  12 log n   O n log n log log n  n log log n   bits  o  log n   000 001 010 011 100 101 110 111 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 2 2 2 0 1 1 2 1 2 2 113 Theorem：Rank on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n + O(n log log n/log n) bits. Note: The size of table for computing rank inside a small block is O n log n log log n  bits. This can be ignored theoretically, but not in practice. 12 Succinct Index for Select • More difficult than rank – – – – Let i = q (log2 n) + r (½ log n) + s if i is multiple of log2 n, store select(i) in S1 if i is multiple of ½ log n, store select(i) in S2 elements of S2 may become (n) →(log n) bits are necessary to store them → (n) bits for the entire S2 – select can be computed by binary search using the index for rank, but it takes O(log n) time • Divide B into large blocks, each of which contains log2 n ones. • Use one of two types of data structures for each 13 large block • If the length of a large block exceeds logc n – store positions of log2 n ones – log 2 n  log n  log 3 n bits for a large block – the number of such large blocks is at most n 3  log n bits – In total c log n – By letting c = 4, index size is n log c n n bits log n 14 • If the length m of a large block is at most logc n – devide it into small blocks of length ½ log n – construct a complete log n -ary tree storing small blocks in their leaves – each node of the tree stores the number of ones in the bit vector stored in descendants – number of ones in a large block is log2 n → each value is in 2 log log n bits 9 log n O(c) 3 1 2 4 0 1 1 2 2 1 0 1 B = 100101000100010011100000001 m  logc n 15 • The height of the tree is O(c) • Space for storing values in the nodes for a large block is O cm log log n  bits  log n  • Space for the whole vector is  cn log log n   bits O  log n  • To compute select(i), search the tree from the root – the information about child nodes is represented in log n  2 log log n  Olog n bits → the child to visit is determined by a table lookup in constant time • Search time is O(c), i.e., constant 16 Theorem：rank and select on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n+O(n log log n /log n) bits. Note： rank inside a large block is computed using the index for select by traversing the tree from a leaf to the root and summing rank values. 17 Extension of Rank/Select (1) • Queries on 0 bits – rankc(B, x): number of c in B[0..x] = B[0]B[1]…B[x] – selectc(B, i): position of i-th c in B (i  1) • from rank0(B, x) = (x+1)  rank1(B, x), rank0 is done in O(1) using no additional index • select0 is not computed by the index for select1 add a similar index to that for select1 18 Extension of Rank/Select (2) • Compression of sparse vectors – A 0,1 vector of length n, having m ones – lower bound of size log  n   m log n  n  mlog  – if m << n  m   m n nm   n  n   log  m log  1.44m  n    m   m  • Operations to be supported – rankc(B, x): number of c in B[0..x] = B[0]B[1]…B[x] – selectc(B, i): position of i-th c in B (i  1) 19 Entropy of String Definition: order-0 entropy H0 of string S 1 H 0 S    pc log pc cA (pc: probability of appearance of letter c) Definition: order-k entropy – assumption: Pr[S[i] = c] is determined from S[ik..i1] (context) 1 nH k S    ns  ps ,c log abcababc ps , c cA sAk context – ns: the number of letters whose context is s – ps,c: probability of appearing c in context s 20 • The information-theoretic lower bound for a sparse vector asymptotically matches the order-0 entropy of the vector   n  n n     log  m log  n  m log    m nm   m  n  n pi log pi i  nH 0 B  21 Compressing Vectors (1) • Divide vector into small blocks of length l = ½ log n • Let mi denote number of 1’s in i-th small block Bi – a small block is represented in   l  log l  log     mi  #ones • Total space for all blocks is bits #possible patterns with speficied #ones    l   n / l   l      log l  log      log l  log    1    i 0   mi     mi   i 0  n/l  l  n  log     1  log l  l i  0  mi  n  n log log n    log    O m  log n  n/l 22 • indexes for rank, select are the same as those for uncompressed vectors: O(n log log n /log n) bits • numbers of 1’s in small blocks are already stored in the index for select • necessary to store pointers to small blocks • let pi denote the pointer to Bi – 0= p0 < p1 < <pn/l < n (less than n because compressed) – pi  pi1  ½ log n • if i is multiple of log n, store pi as it is (no comp.) n n  log n  bits 2 log n log n • otherwise, store pi as the difference n 1     log 2 log n  log n   O n log log n / log n bits 1 2 log n 23 Theorem： A bit-vector of length n and m ones is n  n log log n  stored in log  m    log n  bits, and     rank0, rank1, select0, select1 are computed in O(1) time on word RAM with word length (log n) bits. The data structure is constructed in O(n) time. This data structure is called FID (fully indexable dictionary) (Raman, Raman, Rao [2]). n  n log log n  Note: if m << n, it happens log  m    log n  , and the     size of index for rank/select cannot be ignored. 24 References [1] G. Jacobson. Space-efficient Static Trees and Graphs. In Proc. IEEE FOCS, pages 549–554, 1989. [2] R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets, ACM Transactions on Algorithms (TALG) , Vol. 3, Issue 4, 2007. [3] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [4] R. Grossi, A. Gupta, and J. S. Vitter. High-Order EntropyCompressed Text Indexes. In Proc. ACM-SIAM SODA, pages 841– 850, 2003. [5] Paolo Ferragina, Giovani Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20, 24 pages, 2007. 25 [6] Jérémy Barbay, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Alphabet Partitioning for Compressed Rank/Select and Applications. Proc. ISAAC, pages 315-326. LNCS 6507, 2010. [7] Alexander Golynski , J. Ian Munro , S. Srinivasa Rao, Rank/select operations on large alphabets: a tool for text indexing, Proc. SODA, pp. 368-373, 2006. [8] Jérémy Barbay, Meng He, J. Ian Munro, S. Srinivasa Rao. Succinct indexes for strings, binary relations and multi-labeled trees. Proc. SODA, 2007. [9] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Information Processing Letters, 17(2):81– 84, 1983. [10] J. Ian Munro, Rajeev Raman, Venkatesh Raman, S. Srinivasa Rao. Succinct representations of permutations. Proc. ICALP, pp. 345–356, 2003. [11] R. Pagh. Low Redundancy in Static Dictionaries with Constant 26 Query Time. SIAM Journal on Computing, 31(2):353–363, 2001.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download n - Researchmap