Download Succinct Data Structures - the David R. Cheriton School of Computer

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Array data structure wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Quadtree wikipedia , lookup

B-tree wikipedia , lookup

Red–black tree wikipedia , lookup

Interval tree wikipedia , lookup

Binary tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
Succinct Data Structures
Ian Munro
University of Waterloo
Joint work with David Benoit, Andrej Brodnik, D, Clark,
F. Fich, M. He, J. Horton, A. López-Ortiz, S.
Srinivasa Rao, Rajeev Raman, Venkatesh Raman,
Adam Storm …
How do we encode a large tree or other
combinatorial object of specialized
information
… even a static one
in a small amount of space
and still perform queries in constant time
???
Example of a Succinct Data
Structure:
The (Static) Bounded Subset
Given: Universe of n elements [0,...n-1]
and m arbitrary elements from this universe
Create: a static structure to support search
in constant time (lg n bit word and usual
operations)
Using: Essentially minimum possible # bits
n ))
... lg((m
Operation: Member query in O(1) time
(Brodnik & M.)
Focus on Trees
.. Because Computer Science is .. Arbophilic
- Directories (Unix, all the rest)
- Search trees (B-trees, binary search trees,
digital trees or tries)
- Graph structures (we do a tree based
search)
- Search indices for text (including DNA)
A Big Patricia Trie / Suffix Trie
0
1
0
1
100011



Given a large text file; treat it as bit vector
Construct a trie with leaves pointing to unique
locations in text that “match” path in trie (paths
must start at character boundaries)
Skip the nodes where there is no branching ( n-1
internal nodes)
Space for Trees
Abstract data type: binary tree
Size: n-1 internal nodes, n leaves
Operations: child, parent, subtree size, leaf
data
Motivation: “Obvious” representation of an n
node tree takes about 6 n lg n words (up,
left, right, size, memory manager, leaf
reference)
i.e. full suffix tree takes about 5 or 6 times
the space of suffix array (i.e. leaf
references only)
Succinct Representations of Trees
Start with Jacobson, then others:
There are about 4n/(πn)3/2 ordered rooted
trees, and same number of binary trees
Lower bound on specifying is about 2n bits
What are the natural representations?
Arbitrary Ordered Trees
Use parenthesis notation
 Represent the tree

As the binary string (((())())((())()())):
traverse tree as “(“ for node, then
subtrees, then “)”
 Each node takes 2 bits

Heap-like Notation for a Binary Tree
Add external nodes
Enumerate level by level
1
1
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
Store vector 11110111001000000 length2n+1
(Here don’t know size of subtrees; can be overcome.
Could use isomorphism to flip between notations)
How do we Navigate?
Jacobson’s key suggestion:
Operations on a bit vector
rank(x) = # 1’s up to & including x
select(x) = position of xth 1
So in the binary tree
leftchild(x) = 2 rank(x)
rightchild(x) = 2 rank(x) + 1
parent(x) = select(x/2)
Rank & Select
Rank -Auxiliary storage ~ 2nlglg n / lg n bits
#1’s up to each (lg n)2 rd bit
#1’s within these too each lg nth bit
Table lookup after that
Select -more complicated but similar notions
Key issue: Rank & Select take O(1) time
with lg n bit word (M. et al)
Aside: Interesting data type by itself
Other Combinatorial Objects
Planar Graphs (Lu et al)
Permutations [n]→ [n]
Or more generally
Functions [n] → [n]
But what are the operations?
Clearly π(i), but also π -1(i)
And then π k(i) and π -k(i)
Suffix Arrays (special permutations) in
linear space
Permutations: a Shortcut Notation
Let P be a simple array giving π; P[i] = π[i]
Also have B[i] be a pointer t positions back
in (the cycle of) the permutation;
B[i]= π-t[i] .. But only define B for every
tth position in cycle. (t is a constant;
ignore cycle length “round-off”)
2
4
5
13
1
8
3
12
10
So array representation
P = [8 4 12 5 13 x x 3 x 2 x 10 1]
1
2
3
4
5
6
7
8
9
10
11
12
13
Representing Shortcuts
In a cycle there is a B every t positions …
But these positions can be in arbitrary order
Which i’s have a B, and how do we store it?
Keep a vector of all positions
0 indicates no B 1 indicates a B
Rank gives the position of B[“i”] in B array
So: π(i) and π -1(i) in O(1) time & (1+ε)n lg n
bits
Theorem: Under a pointer machine model with
space (1+ ε) n references, we need time 1/ε
to answer π and π -1 queries; i.e. this is as
good as it gets.
Getting n lg n Bits: an Aside
This is the best we can do for O(1) operations
But using Benes networks:
1-Benes network is a 2 input/2 output switch
r+1-Benes network … join tops to tops
1
3
2
5
R-Benes Network
3
7
4
8
5
1
6
6
R-Benes Network
7
4
8
2
A Benes Network
Realizing the permutation
(3 5 7 8 1 6 4 2)
1
3
2
5
3
7
4
8
5
1
6
6
7
4
8
2
What can we do with it?
Divide into blocks of lg lg n gates … encode
their actions in a word. Taking advantage
of regularity of address mechanism
and also
Modify approach to avoid power of 2 issue
Can trace a path in time O(lg n/(lg lg n)
This is the best time we are able get for π
and π-1 in minimum space.
Observe: This method “violates” the pointer
machine lower bound by using
“micropointers”.
Back to the main track:
Powers of π
Consider the cycles of π
( 2 6 8)( 3 5 9 10)( 4 1 7)
Keep a bit vector to indicate the start of
each cycle
( 2 6 8 3 5 9 10 4 1 7)
Ignoring parentheses, view as new
permutation, ψ.
Note: ψ-1(i) is position containing i …
So we have ψ and ψ-1 as before
Use ψ-1(i) to find i, then bit vector (rank,
select) to find πk or π-k
Functions
Now consider arbitrary functions [n]→[n]
“A function is just a hairy permutation”
All tree edges lead to a cycle
Challenges here
Essentially write down the components in a
convenient order and use the n lg n bits to
describe the mapping (as per
permutations)
To get fk(i):
Find the level ancestor (k levels up) in a
tree
Or
Go up to root and apply f the remaining
number of steps around a cycle
Level Ancestors
There are several level ancestor techniques
using
O(1) time and O(n) WORDS.
Adapt Bender & Farach-Colton to work in
O(n) bits
But going the other way …
f-k is a set
Moving Down the tree requires
care
f-3( ) = ( )
The trick:
Report all nodes on a given
level of a tree in time
proportional to the number of
nodes, and
Don’t waste time on trees with
no answers
Final Function Result
Given an arbitrary function f: [n]→[n]
With an n lg n + O(n) bit representation we
can compute fk(i) in O(1) time and f-k(i) in
time O(1 + size of answer).
Back to Text … And Suffix Arrays
Text T[1..n] over (a,b)*# (a<#<b)
There are 2n-1 such texts, which of the n! suffix
arrays are valid?
1
2
3
4
5
6
7
8
7
5
1
8
3
6
2
b
8
b
6
a
1
a
3
b
7
a
2
#
5
M= 4
7
isn’t ..why?
1
5
8
2
3
6
SA= 4
is
a
SA-1= 4
Ascending to Max
M is a permutation so M-1 is its inverse
i.e. M-1[i] says where i is in M
Ascending-to-Max:  1  i  n-2
i)M-1[i] < M-1[n] and M-1[i+1] < M-1[n] 
M-1[i] < M-1[i+1]
ii)M-1[i] > M-1[n] and M-1[i+1] > M-1[n] 
M-1[i] > M-1[i+1]
4
4
7
7
5
1
1
5
8
8
3
2
6
3
2
6
OK
NO
Non-Nesting
Non-Nesting: 1  i,j  n-1 and M-1[i]<M-1[j]
i)M-1[i] < M-1[i+1] and M-1[j] < M-1[j+1] 
M-1[i+1] < M-1[j+1]
ii)M-1[i] > M-1[i+1] and M-1[j] > M-1[j+1] 
M-1[i+1] < M-1[j+1]
4
4
7
7
5
1
1
5
8
8
3
2
6
3
2
6
OK
NO
Characterization Theorem for Suffix
Arrays on Binary Texts
Theorem: Ascending to Max & Non-nesting
 Suffix Array
Corollary: Clean method of breaking SA into
segments
Corollary: Linear time algorithm to check
whether SA is valid
Cardinality Queries
T=
a b a a a b b a a a b a a b b#
Remember lengths longest run of a’s and of b’s
SA (broken by runs, but not stored explicitly)
8 3 | 9 4 12| 1 10 5 13|16 | 7 2 11 15|6 14
Ba,  bit vector .. If SA-1[i-1] in an “a” section store 1 in Ba,[SA-1[i]], else 0
Ba
0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1
Create rank structure on Ba, and similarly Bb, (Note these are reversed except at #)
Algorithm Count(T,P)
s ← 1; e ←n; i ← m;
while i>0 and se do
if P[i[=a then
s← rank1(Ba,s-1)+1; e←rank1(Ba,e)
else
s← na + 2 + rank1(Bb,s-1); e←na + 1 +rank1(Bb,e)
i ← i-1
Return max(e-s+1,0)
Time: O(length of query)
Listing Queries
Complex methods
Key idea: for queries of length at least d,
index every dth position .. For T and
forT(reversed)
So we have matches for T[i..n] and T[1,i-1]
View these as points in 2 space (Ferragina &
Manzini and Grossi & Vitter)
Do a range query (Alstrup et al)
Variety of results follow
General Conclusion
Interesting, and useful, combinatorial
objects can be:
Stored succinctly … O(lower bound) +o()
So that
Natural queries are performed in O(1) time
(or at least very close)
This can make the difference between using
them and not …