Download FKS perfect hash table

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Fisher–Yates shuffle wikipedia , lookup

Transcript
The FKS Perfect Hashing Scheme
Yossi Matias
Perfect hashing.
We consider the following perfect hashing problem: Given a set S of n keys from a universe U ,
build a look-up table T of size O(n) such that a membership query (given x 2 U , is x 2 S )
can be answered in constant time.
We show that a perfect hash table can be built in linear expected time. The idea is to build
a two-level table (see Fig. 1). In the rst level, a hash function f partitions the set S into n
subsets, denoted as buckets, B1 B2 : : : Bn . For a bucket Bi , we denote its size as bi = jBi j.
In the second level, each bucket Bi has a separate memory array whose size is (b2i ), and a
separate hash function gi that maps the bucket injectively into that memory array. All the
memory arrays are placed in a single table T , and for each bucket Bi we maintain the o
set
pi, which gives the position in T where Bi 's memory array begins.
A high level description of the algorithm is as follows:
Step 1 Find a function f : U ! 1::n], that partitions S into buckets B1 B2 : : : Bn such
that Pni=1 b2i n, where is a constant that will be determined later.
P
Step 2 For each bucket Bi , compute an o
set pi = ij;1 b2j , and allocate a subarray Mi of
size b2i in array T between positions pi + 1 and pi+1 in T , where is a constant that
will be determined later.
Step 3 For each bucket Bi nd a function gi : u ! 1::b2i ], such that gi is injective on Bi .
For every key x 2 Bi , place x in T pi + gi(x)].
In Step 1, the function f is recorded. We use two additional arrays: P 1::n] to record the
o
sets in Step 2, and G1::n] to record the functions gi in Step 3. The table T is of size
Pni=1 b2i n, and the total memory required by the data structure is therefore O(n),
as required. Given a key x 2 U , a membership query for x is supported in constant time as
follows:
1. Compute i = f (x).
2. Read gi from Gi] and compute j = gi (x).
3. If T P i] + j ] = x then answer \x 2 S ", and otherwise answer \x 62 S ".
More details and analysis:
In our analysis we will use four basic facts from probability theory, and a property of universal
hash functions:
1. Boole's inequality: For any sequence of events A1 A2 : : : Am , m 1,
Pr (A1 A2 Am ) Pr (A1) + Pr (A2) + + Pr (Am ).
1
Data structures, Fall 1997
Dr. Yossi Matias
2. Markov inequality: Let X be a nonnegative random variable, and suppose that E (X )
is well dened. Then for all t > 0, Pr (X t) E (X )=t. Alternatively, for all > 0,
Pr (X E (X )) 1= .
3. Linearity of expectation: E (X + Y ) = E (X ) + E (Y ) more generally,
E (Pni=1 Xi) = Pni=1 E (Xi).
4. Expectation in geometric-like distribution: Suppose that we have a sequence of Bernoulli
trials, each with a probability p of success and a probability 1 ; p of failure. Then
the expected number of trials needed to obtain a success is at most 1=p.
5. Collisions in universal hash functions: If h is chosen from a universal collection of hash
functions and is used to hash N keys into a table of size B , the expected number of
collisions involving a particular key x is (N ; 1)=B .
We can now provide more details on Step 1, which consists of the following sub-steps.
Step 1a Select at random a function f : U ! 1::n] from a universal class of hash functions.
Step 1b Compute a hash-table T 0 with chaining using the hash function f , so that insertion
takes constant time.
Step 1c Compute an array B 2, so that B 2i] = b2i .
P
Step 1d If ni=1 b2i > n then go to Step 1a otherwise record the function f .
Analysis Step 1a takes constant time, Step 1b takes O(n) time and O(n) space, Step 1c
takes O(n) time using the table T 0, and Step 1d takes O(n) time, using array B 2. The time
complexity, T1, of Step 1 is therefore O(tn), where t is the number of iterations, i.e., the
number of functions f selected before the condition Pni=1 b2i n is satised. The following
claim shows that for 4 we have E (T1) = O(n).
Claim: If 4 then E (t) 2.
Proof. Let Cx be the number of collisions of a key x 2 S under f i.e., the number of y 2 S ,
y=
6 x, for which f (x) = f (y ). Due to the collision property of universal hash functions (with
N = B = n) we have E (Cx) < 1.
We consider the total number of collisions CS in S . Specically, let CS be the number of
(ordered) pairs hx y i, x y 2 S and x 6= y , such that f (x) = f (y ). Clearly, CS = Px2S Cx.
Therefore, by linearity of expectation,
X
E (CS ) = E (Cx) < jS j 1 = n :
(1)
x2X
2
Data structures, Fall 1997
Dr. Yossi Matias
On the other hand, we note that collisions are dened among keys mapped into the same
buckets, and can be counted as:
n
n
n
n
X
X
X
X
2
CS = jfhx y ig : x y 2 Bi x 6= y gj = bi (bi ; 1) = bi ; bi :
i=1
i=1
i=1
i=1
P
n
Therefore, since
b = n,
n
X
i=1
i=1 i
b2i = CS + n
and by Eq (1)
n !
X
E b2i = E (CS ) + n < 2n :
i=1
By Markov Inequality, applied to the random variable X = Pni=1 b2i ,
!
n
X
2
Pr bi 4n 1=2 :
i=1
P
If 4, then for a function f selected at random the condition ni=1 b2i 4n is satised
with probability at least 1=2. Therefore, the expected number, t, of functions f tried before
the condition is satised is at most 2.
To compute Step 2, note that pi = pi;1 + b2i;1 for i > 1, and p1 = 0. Therefore, pi can be
computed and recorded in array P by iterating for i = 1 : : : n. Step 2 takes T2 = O(n) time.
Finally, Step 3 consists of the following sub-steps, executed for all i, i = 1 : : : n:
Step 3a Initialize the subarray T P i] + 1 : : : P i + 1]] to nil.
Step 3b Select at random a function gi : U ! 1::b2i ] from a universal class of hash functions.
Step 3c For each x 2 Bi , if T P i] + gi (x)] is not nil then go to Step 3a (gi is not injective
on Bi and a new gi is to be selected) else write x into T P i] + gi (x)].
Step 3d Record gi in Gi].
Analysis We analyze rst Step 3 for bucket Bi. Step 3a takes time O(b2i ). Step 3b takes
constant time. Step 3c can be implemented in O(bi) time, using the i'th list in the hash table
T 0 computed in Step 1. Step 3d takes constant time. The time complexity of Step 3 for bucket
Bi is therefore ti = O(ib2i ), where i is the number of iterations, i.e., the number of functions
gi selected before an injective function is found for Bi .
Comment: We could have each iteration take only O(bi) time by removing Step 3a, initializing
the table T in Step 2, and modify Step 3c as follows:
Step 3c' For each x 2 Bi , if T P i] + gi (x)] is not nil then for all y 2 Bi assign nil to
T P i] + gi(y)] and go to Step 3a else write x into T P i] + gi(x)].
The following claim shows that for 2 we have E (ti ) = O(b2i ).
3
4
Dr. Yossi Matias
Data structures, Fall 1997
Claim: If 2 then E (i) 2.
Proof. Let Cx be the number of collisions of a key x in Bi under gi i.e., the number of y 2 Bi ,
y=
6 x, for which gi(x) = gi (y ). Due to the collision property of universal hash functions (with
N = bi and B = b2i ) we have
E (Cx) < bi=(b2i ) = 1=bi :
By Markov Inequality,
Pr (Cx 1) E (Cx) < 1=bi :
(2)
Therefore, by Boole's inequality and Eq (2), the probability that there are any collisions in Bi
is
Pr (9x 2 Bi such that Cx 1) bi (1=bi) = 1= :
For 2, the function gi is injective with probability at least 1 ; 1= 1=2, and the expected
number of trials, i , required before an injective function is found is at most 2.
For 2 we have
E (ti ) = O(E (i)b2i ) = O(b2i ) :
The total time, T3, for Step 3 over all buckets is then
n
n
X
X
E (T3) = E (ti) = O( b2i ) = O(n)
i=1
i=1
The running time, T , of the entire algorithm can now be bounded as
E (T ) = E (T1 + T2 + T3) = E (T1) + E (T2) + E (T3) = O(n) :
Exercises
1. If h is chosen at random from an almost-universal collection of hash functions and is used
to hash N keys into a table of size B , the collision probability of any two particular keys
x and y is at most 2=B, and the expected number of collisions involving a particular key
x is at most 2(N ; 1)=B.
Modify the algorithm above so that almost-universal functions are used instead of universal functions, and such that the expected running time remains O(n).
2. Modify the algorithm above and analyze it, so that the rst level function f maps the
input set S into 2n buckets, instead of n buckets.
3. () Generalizing (2), modify the algorithm above and analyze it, so that the rst level
function f maps the input set S into n buckets, and select that gives favorable
complexity (in terms of constants).
Fredman, Komlos, Szemeredi. Storing a sparse table with O(1) worst case access time, JACM, 31, 1984, pp 538-544.