Download ppt

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Control table wikipedia , lookup

B-tree wikipedia , lookup

Array data structure wikipedia , lookup

Rainbow table wikipedia , lookup

Hash table wikipedia , lookup

Bloom filter wikipedia , lookup

Generalized Hashing with
Variable-Length Bit Strings
Michael Klipper
Dan Blandford
Guy Blelloch
Original source:
D. Blandford and G. E. Blelloch. Storing Variable-Length Keys
in Arrays, Sets, and Dictionaries, with Applications. In
Symposium on Discrete Algorithms (SODA), 2005 (hopefully)
Hashing techniques
currently available
 Many hashing algorithms out there:
 Separate chaining
 Cuckoo hashing
 FKS perfect hashing
 Also many hash functions designed,
including several universal families
 O(1) expected amortized time for updates,
and many have O(1) worst case time for
 They use W(n lg n) bits for n entries, since
at least lg n bits are used per entry to
distinguish between keys.
What kind of bounds do we
Let’s say we store n entries in our hashtable of
the form (si, ti) for i = 0, 1, 2, … (n-1). Each
si and ti are bit strings of variable length. For
our purposes, many of the ti’s might only be a
few bits long.
Time for all operations (later slide):
O(1) expected amortized
Total space used:
O(Si max(|si| - lg n, 1) + |ti|) bits
The Improvement We Attain
Let’s say we store n entries taking up m total bits. In
terms of the si and ti values on the previous slide,
m = Si |si| + |ti|
Note that m = W(n lg n).
Thus, our space usage is O(m – n lg n) bits, as
opposed to the W(m) bits that standard hashtable
structures use.
In particular, our structure is much more efficient than
standard structures when m is close to n lg n (for
example, when most entries are only a few bits long).
Generalized Dynamic Hashtables
We want to support the following operations:
 query(key, keyLength)
 Looks up the key in the hashtable and
returns the data associated and its length
 insert(key, keyLength, data, dataLength)
 Adds (key, data) as an entry in the hashtable
 remove(key, keyLength)
 Removes the key and the data associated
NOTE: Each key will only have one entry associated
with it. Another name for this kind of structure is a
variable-length dictionary structure.
Other Structures
 Variable-Length Sets
 Also supports query, insert, and remove, though
there is no extra data associated with keys
 Can be easily implemented as a generalized
hashtable that stores no extra data
 O(1) expected amortized time for all operations
 If the n keys are s0, s1, … sn-1, then the total
space used in bits is
O(Si max(|si| - lg n, 1))
Other Structures (cont.)
 Variable-Length Arrays
 For n entries, the keys are 0, 1, … n-1.
 These arrays will not be able to resize their
number of entries.
 Operations:
 get(i) returns the data stored at index i
and its length
 set(i, val, len) updates the data at index
i to val of length len
 Once again, O(1) expected amortized time for
operations. Total space usage is O(Si |ti|).
Implementation Note
Assume for now that we have a variablelength array structure described on the
previous slide. We will use this to make
generalized dynamic hashtables, which are
more interesting than the arrays.
At the end of this presentation, I can talk
about implementation of variable-length
arrays if time permits.
The Main Idea Behind
How Hashtables Work
Our generalized hashtable structure contains a
variable-length array with 2q entries (which
will serve as the buckets for the hashtable).
We keep 2q approximately equal to n by
occasional rehashing of the bucket contents.
The item (si, ti), where si is the key and ti is
the data, is placed in a bucket as follows: we
first hash si to some index (more on this
later), and we write (si, ti) into the bucket
specified by that index. Note that when we
hash si, we implicitly treat it as an integer.
Hashtables (cont.)
If several entries of the set collide in a bucket, we
throw them all into the bucket together as one
giant concatenated bit string. Thus, we
essentially use a separate-chaining algorithm.
To tell where one entry starts and another begins,
we encode the entries with a prefix-free code
(such as Huffman codes or gamma codes).
Sample bucket
(where si’ is
si encoded, etc.)
t 1’
t 2’
t 3’
Time and Space Bounds
Note that we use prefix-free codes that only use a constant
factor more space (i.e. they encode m bits in O(m) space)
and can be encoded/decoded in O(1) time.
Time: If we use a universal hash function to determine the
bucket index, then each bucket receives only a constant
expected number of elements, so it takes O(1) expected
amortized time to find an element in a bucket. The prefixfree codes we use allow O(1) decoding of any element.
Space: The prefix-free codes increase the amount of bits
stored by at most a constant factor. If we have m bits total
we want to store, our space bound for variable-length arrays
says that the buckets take up O(m) bits.
There’s a bit more than that…
Recall the space bound for the hash table is
O(Si max(|si| - lg n, 1) + |ti|).
Where does the lg n savings per entry come from?
We perform a technique called quotienting.
We actually use two hash functions h’ and h’’. h’(si) is
the bucket index, and h’’(si) has length max(|si| - q,
1). (Recall that 2q is approximately n.)
Instead of writing (si, ti) in the bucket, we actually
write (h’’(si), ti). This way, each entry needs |h’’(si)| +
|ti| bits to write, which fulfills our space bound above.
A Quotienting Scheme
Let h0 be a hash function from a universal family whose
range is q bits. We describe a way to make a family of hash
functions from the family from which h0 is drawn.
be the q most
significant bits of si,
and let sib be the other bits.
We define our hash functions
as follows:
h’’(si) = sib
h’(si) = h0(sib) xor sit
sib = h’’(si)
si 101101 001010100100101
111110 h’(si)
Undoing the Quotienting
In the previous example, we saw that h’(si) evaluated
to 111110, or 62. This means we store h’’(si) in
bucket number 62!
Note that given h’(si) and h’’(si) we can retrieve si
sib = h’’(si)
sit = h0(h’’(si)) xor h’(si).
The family of h’ functions we make is another universal
family, so our time bound explained earlier still holds.
An Application of Hashtables:
Graph Structures
One area where we will be able to use the hashtable
structure is in storing graphs. Here, we describe a
semidynamic directed-graph implementation. This means
that the number of vertices is fixed, but edges can be added
or deleted at runtime.
Let u and v be vertices of a graph. We want the following
operations compactly and in O(1) expected amortized time:
• deg(v) - get the degree of vertex v
• adjacent(u, v) - returns true iff u and v are adjacent
• firstEdge(v) - returns the first neighbor of v in G
• nextEdge(u, v) - returns the next neighbor of u after
v (assumes u and v are adjacent)
• addEdge(u, v) - adds an edge from u to v in G
• deleteEdge(u, v) - deletes the edge (u, v) from G
Hashing Integers
Up to now, we have used bit strings as
the main objects in the hashtable. It
will also be useful to hash on integer
values. Hence, we have created some
utilities to convert between bit strings
and integers using as few bits as
possible, so an integer x takes basically
lg |x| bits to write as a bit string.
A Graph Layout Where We Store
Edges in a Hashtable
Let’s say u is a vertex of degree d and v1, … vd are its
neighbors. Let’s say that v0 = vd+1 = u by convention.
Then the entry representing the edge (u, vi) has key (u, vi)
and data (vi-1, vi+1).
Hash Table
This extra
entry “starts”
the list.
Degree of
Implementations of a Couple
For simplicity, I’m leaving off the length arguments in
query() and insert().
adjacent(u, v)
• return (query((u, v)) != -1);
• let (vp, vn, d) = query((u, u));
• return vn;
addEdge(u, v)
• let (vp, vn, d) = query((u, u));
• remove((u, u));
• insert((u, u), (vp, v, d + 1));
• insert((u, v), (u, vn));
Compression and Space Usage
 Instead of ((u, vi), (vi-1, vi+1)) in the table,
we will store
((u, vi – u), (vi-1 – u, vi+1 – u))
 With this representation, we need
O(S(u,v)E lg |u – v|) space.
 A good labeling of the vertices will make
many of these differences small. For
instance, for many classes of graphs, such
as planar graphs, the total space used is
O(n) bits! The following paper has details:
D. Blandford, G. E. Blelloch, and I. Kash. Compact
Representations of Separable Graphs. In SODA, 2003, pages
More Details about
Implementing Arrays
We’ll use the following data for our example in
these slides:
t0 = 10110
t3 = 0101
t1 = 0110
t4 = 1100
t5 = 010
t2 = 11111
t6 = 11011
t7 = 00001111
We’ll assume that the word size is 2 bytes.
Key Idea: BLOCKS
 Multiple data items can be crammed into a
word, so let’s take advantage of that.
 There are many possible ways to store data
in blocks. The way that I’ll discuss here is
to use two words per block: one stores data
and one marks separation of entries.
1st word 1 0 1 1 0 0 1 1 0 1 1 1 1 1
2nd word 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
This is the block containing strings t0 through t2 from our example.
Blocks: continued
We’ll name a block bi if i is the first entry number to be
stored in that block. The size of a block is the sum of the
sizes of the entries inside it.
We’ll maintain a size invariant:
for any adjacent blocks bi and bj, |bi| + |bj| is at
least a full word.
Note: splitting and merging blocks is easy.
We assume these things for now:
 Entries fit into a word… we can handle longer entries
by storing a pointer to separate memory in its place
 Entries are nonempty
Organization of Blocks
We have a bit array A of
length n (this is a regular
old C array). A[i] = 1 if
and only if string #i starts a
block. This is our indexing
We also have a standard
hashtable H. If string #i
starts a block, H(i) =
address of bi. We assume
H is computed in O(1)
expected amortized time.
Blocks are large enough
that storing them in H only
increases the space usage
by a constant factor.
In this example, b0 and
b3 are adjacent blocks, as
are b3 and b7.
t0 t1
b3 t3 t4 t5 t6
A Note about Space Usage
Any two 1’s in the indexing structure A
are separated by at most one word. This
is because entries are nonempty and a
block only holds one word for entries.
The get() operation
 Since bits that are turned on in A are close,
we can find the block to which an entry
belongs in O(1) time. One way to do this is
table lookup.
 If the ith entry is in block bk, then the ith
entry of the array is the (i – k + 1)st entry in
that block.
 By using table lookup, we can find where the
correct 1’s in the second word are, which tell
us where the entry starts and ends.
A picture of the get() operation,
illustrated with get(2)
To find entry #2, we look in block #0.
10110 0110 11111
10000 1000 10000 10
Entry 2 is 5 bits long.
It is 11111.
How set() works in a nutshell
1) Find the block with the entry.
2) Rewrite it.
3) If the block is too large, split it into
4) Merge adjacent blocks together to
preserve the size invariant.
Now, to prove the theorem about
space usage for arrays
 Let m = Si |ti| and w = machine word size. I
claim the total number of bits used is O(m).
 Our size invariant for blocks guarantees that
on average, blocks are half full. Thus, there
are O(m / w) blocks used, since there are m
bits total of data and each block has W(w)
bits stored in it on average.
 Our indexing structure A and hashtable H use
O(w) bits per block (O(1) words). Total bits:
O(m / w) blocks * O(w) per block = O(m) bits.
A note about entries
longer than w bits
What is really done in our code with entries longer
than w bits is not just allocating separate memory
and putting a pointer in the array, though it’s
We do essentially what standard structures do,
and we chain the words making up our entry into
a linked list. We have a clever way to do this
which doesn’t need to use w-bit pointers; instead
we only need 7 or 8 bits for a pointer.