Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Succinct Data Structures
Kunihiko Sadakane
National Institute of Informatics
Succinct Data Structures
• Succinct data structures = succinct representation of
data + succinct index
• Examples
–
–
–
–
sets
trees, graphs
strings
permutations, functions
2
Succinct Representation
• A representation of data whose size (roughly)
matches the information-theoretic lower bound
• If the input is taken from L distinct ones, its
information-theoretic lower bound is log L bits
• Example 1: a lower bound for a set S {1,2,...,n}
– log 2n = n bits
n=3
{1}
{2}
{3}
{1,2} {1,3} {2,3}
Base of logarithm is 2
{1,2,3}
3
• Example 2: n node ordered tree
1 2n 1
2n log n bits
log
2n 1 n 1
n=4
• Example 3: length-n string
– n log bits (: alphabet size)
4
• For any data, there exists its succinct representation
– enumerate all in some order, and assign codes
– can be represented in log L bits
– not clear if it supports efficient queries
• Query time depends on how data are represented
– necessary to use appropriate representations
n=4
000
1 2n 1
log 5 3 bits
log
2n 1 n 1
001
010
011
100
5
Succinct Indexes
• Auxiliary data structure to support queries
• Size: o(log L) bits
• (Almost) the same time complexity as using
conventional data structures
• Computation model: word RAM
– assume word length w = log log L
(same pointer size as conventional data structures)
6
word RAM
• word RAM with word length w bits supports
– reading/writing w bits of memory at arbitrary address in
constant time
– arithmetic/logical operations on two w bits numbers are
done in constant time
– arithmetic ops.: +, , *, /, log (most significant bit)
– logical ops.: and, or, not, shift
• These operations can be done in constant time using
O(w) bit tables ( > 0 is an arbitrary constant)
7
1. Bit Vectors
• B: 0,1 vector of length n B[0]B[1]…B[n1]
• lower bound of size = log 2n = n bits
• queries
– rank(B, x): number of ones in B[0..x]=B[0]B[1]…B[x]
– select(B, i): position of i-th 1 from the head (i 1)
• basics of all succinct data structures
• naïve data structure
– store all the answers in arrays
– 2n words (2 n log n bits)
0 3 5
9
B = 1001010001000000
– O(1) time queries
n = 16
8
Succinct Index for Rank:1
• Divide B into blocks of length log2 n
• Store rank(x)’s at block boundaries in array R
x / log2 n
rank ( x) B[i] B[i]
x
i 1
i 1
x
B[i]
i x / log2 n 1
R[x/log2 n]
• Size of R
n
n
log
n
bits
2
log n
log n
• rank(x): O(log2 n) time
9
Succinct Index for Rank:2
• Divide each block into small blocks of length ½log n
• Store rank(x)’s at small block boundaries in R2
x / 12 logn
x
x / log2 n
rank ( x) B[i] B[i] B[i] B[i]
i 1
i 1
i x / 12 logn 1
i x / log2 n 1
x
R1[x/log2 n]
• Size of R2
R2[x/log n]
n
4n log log n
2
log log n
bits
1
log n
2 log n
• rank(x): O(log n) time
10
Succinct Index for Rank:3
• Compute answers for all possible queries for small
blocks in advance, and store them in a table x
x
B[i]
i x / 12 log n 1
All patterns of
½ log n bits
• Size of table
2
1 log n
2
12 log n log 12 log n
O n log n log log n
n log log n
bits
o
log n
000
001
010
011
100
101
110
111
0
0
0
0
0
1
1
1
1
1
0
0
1
1
1
1
2
2
2
0
1
1
2
1
2
2
113
Theorem:Rank on a bit-vector of length n is computed
in constant time on word RAM with word length
(log n) bits, using n + O(n log log n/log n) bits.
Note: The size of table for computing rank inside a
small block is O n log n log log n bits. This can be
ignored theoretically, but not in practice.
12
Succinct Index for Select
• More difficult than rank
–
–
–
–
Let i = q (log2 n) + r (½ log n) + s
if i is multiple of log2 n, store select(i) in S1
if i is multiple of ½ log n, store select(i) in S2
elements of S2 may become (n) →(log n) bits are
necessary to store them → (n) bits for the entire S2
– select can be computed by binary search using the index
for rank, but it takes O(log n) time
• Divide B into large blocks, each of which contains
log2 n ones.
• Use one of two types of data structures for each
13
large block
• If the length of a large block exceeds logc n
– store positions of log2 n ones
– log 2 n log n log 3 n bits for a large block
– the number of such large blocks is at most
n
3
log
n bits
– In total
c
log n
– By letting c = 4, index size is
n
log c n
n
bits
log n
14
• If the length m of a large block is at most logc n
– devide it into small blocks of length ½ log n
– construct a complete log n -ary tree storing small blocks
in their leaves
– each node of the tree stores the number of ones in the
bit vector stored in descendants
– number of ones in a large block is log2 n
→ each value is in 2 log log n bits
9
log n
O(c)
3
1
2
4
0
1
1
2
2
1
0
1
B = 100101000100010011100000001
m logc n
15
• The height of the tree is O(c)
• Space for storing values in the nodes for a large
block is O cm log log n bits
log n
• Space for the whole vector is
cn log log n
bits
O
log n
• To compute select(i), search the tree from the root
– the information about child nodes is represented in
log n 2 log log n Olog n bits → the child to visit is
determined by a table lookup in constant time
• Search time is O(c), i.e., constant
16
Theorem:rank and select on a bit-vector of length n is
computed in constant time on word RAM with word
length (log n) bits, using n+O(n log log n /log n) bits.
Note: rank inside a large block is computed using the
index for select by traversing the tree from a leaf to
the root and summing rank values.
17
Extension of Rank/Select (1)
• Queries on 0 bits
– rankc(B, x): number of c in B[0..x] = B[0]B[1]…B[x]
– selectc(B, i): position of i-th c in B (i 1)
• from rank0(B, x) = (x+1) rank1(B, x),
rank0 is done in O(1) using no additional index
• select0 is not computed by the index for select1
add a similar index to that for select1
18
Extension of Rank/Select (2)
• Compression of sparse vectors
– A 0,1 vector of length n, having m ones
– lower bound of size log n m log n n mlog
– if m << n
m
m
n
nm
n
n
log
m
log
1.44m n
m
m
• Operations to be supported
– rankc(B, x): number of c in B[0..x] = B[0]B[1]…B[x]
– selectc(B, i): position of i-th c in B (i 1)
19
Entropy of String
Definition: order-0 entropy H0 of string S
1
H 0 S pc log
pc
cA
(pc: probability of appearance
of letter c)
Definition: order-k entropy
– assumption: Pr[S[i] = c] is determined from
S[ik..i1] (context)
1
nH k S ns ps ,c log
abcababc
ps , c
cA
sAk
context
– ns: the number of letters whose context is s
– ps,c: probability of appearing c in context s
20
• The information-theoretic lower bound for a
sparse vector asymptotically matches the order-0
entropy of the vector
n
n
n
log
m
log
n
m
log
m
nm
m
n
n pi log
pi
i
nH 0 B
21
Compressing Vectors (1)
• Divide vector into small blocks of length l = ½ log n
• Let mi denote number of 1’s in i-th small block Bi
– a small block is represented in
l
log l log
mi
#ones
• Total space for all blocks is
bits
#possible patterns with
speficied #ones
l n / l
l
log l log log l log 1
i 0
mi
mi i 0
n/l
l n
log 1 log l
l
i 0 mi
n
n log log n
log O
m
log n
n/l
22
• indexes for rank, select are the same as those for
uncompressed vectors: O(n log log n /log n) bits
• numbers of 1’s in small blocks are already stored in
the index for select
• necessary to store pointers to small blocks
• let pi denote the pointer to Bi
– 0= p0 < p1 < <pn/l < n (less than n because compressed)
– pi pi1 ½ log n
• if i is multiple of log n, store pi as it is (no comp.)
n
n
log
n
bits
2
log n
log n
• otherwise, store pi as the difference
n
1
log
2 log n log n O n log log n / log n bits
1
2
log n
23
Theorem: A bit-vector of length n and m ones is
n
n log log n
stored in log m log n bits, and
rank0, rank1, select0, select1 are computed in O(1) time
on word RAM with word length (log n) bits.
The data structure is constructed in O(n) time.
This data structure is called FID (fully indexable
dictionary) (Raman, Raman, Rao [2]).
n
n log log n
Note: if m << n, it happens log m log n , and the
size of index for rank/select cannot be ignored.
24
References
[1] G. Jacobson. Space-efficient Static Trees and Graphs. In Proc. IEEE
FOCS, pages 549–554, 1989.
[2] R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries
with Applications to Encoding k-ary Trees and Multisets, ACM
Transactions on Algorithms (TALG) , Vol. 3, Issue 4, 2007.
[3] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees
with Applications to Text Indexing and String Matching. SIAM
Journal on Computing, 35(2):378–407, 2005.
[4] R. Grossi, A. Gupta, and J. S. Vitter. High-Order EntropyCompressed Text Indexes. In Proc. ACM-SIAM SODA, pages 841–
850, 2003.
[5] Paolo Ferragina, Giovani Manzini, Veli Mäkinen, and Gonzalo
Navarro. Compressed Representations of Sequences and Full-Text
Indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20,
24 pages, 2007.
25
[6] Jérémy Barbay, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich.
Alphabet Partitioning for Compressed Rank/Select and Applications.
Proc. ISAAC, pages 315-326. LNCS 6507, 2010.
[7] Alexander Golynski , J. Ian Munro , S. Srinivasa Rao, Rank/select
operations on large alphabets: a tool for text indexing, Proc. SODA,
pp. 368-373, 2006.
[8] Jérémy Barbay, Meng He, J. Ian Munro, S. Srinivasa Rao. Succinct
indexes for strings, binary relations and multi-labeled trees. Proc.
SODA, 2007.
[9] Dan E. Willard. Log-logarithmic worst-case range queries are
possible in space theta(n). Information Processing Letters, 17(2):81–
84, 1983.
[10] J. Ian Munro, Rajeev Raman, Venkatesh Raman, S. Srinivasa Rao.
Succinct representations of permutations. Proc. ICALP, pp. 345–356,
2003.
[11] R. Pagh. Low Redundancy in Static Dictionaries with Constant 26
Query Time. SIAM Journal on Computing, 31(2):353–363, 2001.