Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The FKS Perfect Hashing Scheme Yossi Matias Perfect hashing. We consider the following perfect hashing problem: Given a set S of n keys from a universe U , build a look-up table T of size O(n) such that a membership query (given x 2 U , is x 2 S ) can be answered in constant time. We show that a perfect hash table can be built in linear expected time. The idea is to build a two-level table (see Fig. 1). In the rst level, a hash function f partitions the set S into n subsets, denoted as buckets, B1 B2 : : : Bn . For a bucket Bi , we denote its size as bi = jBi j. In the second level, each bucket Bi has a separate memory array whose size is (b2i ), and a separate hash function gi that maps the bucket injectively into that memory array. All the memory arrays are placed in a single table T , and for each bucket Bi we maintain the o set pi, which gives the position in T where Bi 's memory array begins. A high level description of the algorithm is as follows: Step 1 Find a function f : U ! 1::n], that partitions S into buckets B1 B2 : : : Bn such that Pni=1 b2i n, where is a constant that will be determined later. P Step 2 For each bucket Bi , compute an o set pi = ij;1 b2j , and allocate a subarray Mi of size b2i in array T between positions pi + 1 and pi+1 in T , where is a constant that will be determined later. Step 3 For each bucket Bi nd a function gi : u ! 1::b2i ], such that gi is injective on Bi . For every key x 2 Bi , place x in T pi + gi(x)]. In Step 1, the function f is recorded. We use two additional arrays: P 1::n] to record the o sets in Step 2, and G1::n] to record the functions gi in Step 3. The table T is of size Pni=1 b2i n, and the total memory required by the data structure is therefore O(n), as required. Given a key x 2 U , a membership query for x is supported in constant time as follows: 1. Compute i = f (x). 2. Read gi from Gi] and compute j = gi (x). 3. If T P i] + j ] = x then answer \x 2 S ", and otherwise answer \x 62 S ". More details and analysis: In our analysis we will use four basic facts from probability theory, and a property of universal hash functions: 1. Boole's inequality: For any sequence of events A1 A2 : : : Am , m 1, Pr (A1 A2 Am ) Pr (A1) + Pr (A2) + + Pr (Am ). 1 Data structures, Fall 1997 Dr. Yossi Matias 2. Markov inequality: Let X be a nonnegative random variable, and suppose that E (X ) is well dened. Then for all t > 0, Pr (X t) E (X )=t. Alternatively, for all > 0, Pr (X E (X )) 1= . 3. Linearity of expectation: E (X + Y ) = E (X ) + E (Y ) more generally, E (Pni=1 Xi) = Pni=1 E (Xi). 4. Expectation in geometric-like distribution: Suppose that we have a sequence of Bernoulli trials, each with a probability p of success and a probability 1 ; p of failure. Then the expected number of trials needed to obtain a success is at most 1=p. 5. Collisions in universal hash functions: If h is chosen from a universal collection of hash functions and is used to hash N keys into a table of size B , the expected number of collisions involving a particular key x is (N ; 1)=B . We can now provide more details on Step 1, which consists of the following sub-steps. Step 1a Select at random a function f : U ! 1::n] from a universal class of hash functions. Step 1b Compute a hash-table T 0 with chaining using the hash function f , so that insertion takes constant time. Step 1c Compute an array B 2, so that B 2i] = b2i . P Step 1d If ni=1 b2i > n then go to Step 1a otherwise record the function f . Analysis Step 1a takes constant time, Step 1b takes O(n) time and O(n) space, Step 1c takes O(n) time using the table T 0, and Step 1d takes O(n) time, using array B 2. The time complexity, T1, of Step 1 is therefore O(tn), where t is the number of iterations, i.e., the number of functions f selected before the condition Pni=1 b2i n is satised. The following claim shows that for 4 we have E (T1) = O(n). Claim: If 4 then E (t) 2. Proof. Let Cx be the number of collisions of a key x 2 S under f i.e., the number of y 2 S , y= 6 x, for which f (x) = f (y ). Due to the collision property of universal hash functions (with N = B = n) we have E (Cx) < 1. We consider the total number of collisions CS in S . Specically, let CS be the number of (ordered) pairs hx y i, x y 2 S and x 6= y , such that f (x) = f (y ). Clearly, CS = Px2S Cx. Therefore, by linearity of expectation, X E (CS ) = E (Cx) < jS j 1 = n : (1) x2X 2 Data structures, Fall 1997 Dr. Yossi Matias On the other hand, we note that collisions are dened among keys mapped into the same buckets, and can be counted as: n n n n X X X X 2 CS = jfhx y ig : x y 2 Bi x 6= y gj = bi (bi ; 1) = bi ; bi : i=1 i=1 i=1 i=1 P n Therefore, since b = n, n X i=1 i=1 i b2i = CS + n and by Eq (1) n ! X E b2i = E (CS ) + n < 2n : i=1 By Markov Inequality, applied to the random variable X = Pni=1 b2i , ! n X 2 Pr bi 4n 1=2 : i=1 P If 4, then for a function f selected at random the condition ni=1 b2i 4n is satised with probability at least 1=2. Therefore, the expected number, t, of functions f tried before the condition is satised is at most 2. To compute Step 2, note that pi = pi;1 + b2i;1 for i > 1, and p1 = 0. Therefore, pi can be computed and recorded in array P by iterating for i = 1 : : : n. Step 2 takes T2 = O(n) time. Finally, Step 3 consists of the following sub-steps, executed for all i, i = 1 : : : n: Step 3a Initialize the subarray T P i] + 1 : : : P i + 1]] to nil. Step 3b Select at random a function gi : U ! 1::b2i ] from a universal class of hash functions. Step 3c For each x 2 Bi , if T P i] + gi (x)] is not nil then go to Step 3a (gi is not injective on Bi and a new gi is to be selected) else write x into T P i] + gi (x)]. Step 3d Record gi in Gi]. Analysis We analyze rst Step 3 for bucket Bi. Step 3a takes time O(b2i ). Step 3b takes constant time. Step 3c can be implemented in O(bi) time, using the i'th list in the hash table T 0 computed in Step 1. Step 3d takes constant time. The time complexity of Step 3 for bucket Bi is therefore ti = O(ib2i ), where i is the number of iterations, i.e., the number of functions gi selected before an injective function is found for Bi . Comment: We could have each iteration take only O(bi) time by removing Step 3a, initializing the table T in Step 2, and modify Step 3c as follows: Step 3c' For each x 2 Bi , if T P i] + gi (x)] is not nil then for all y 2 Bi assign nil to T P i] + gi(y)] and go to Step 3a else write x into T P i] + gi(x)]. The following claim shows that for 2 we have E (ti ) = O(b2i ). 3 4 Dr. Yossi Matias Data structures, Fall 1997 Claim: If 2 then E (i) 2. Proof. Let Cx be the number of collisions of a key x in Bi under gi i.e., the number of y 2 Bi , y= 6 x, for which gi(x) = gi (y ). Due to the collision property of universal hash functions (with N = bi and B = b2i ) we have E (Cx) < bi=(b2i ) = 1=bi : By Markov Inequality, Pr (Cx 1) E (Cx) < 1=bi : (2) Therefore, by Boole's inequality and Eq (2), the probability that there are any collisions in Bi is Pr (9x 2 Bi such that Cx 1) bi (1=bi) = 1= : For 2, the function gi is injective with probability at least 1 ; 1= 1=2, and the expected number of trials, i , required before an injective function is found is at most 2. For 2 we have E (ti ) = O(E (i)b2i ) = O(b2i ) : The total time, T3, for Step 3 over all buckets is then n n X X E (T3) = E (ti) = O( b2i ) = O(n) i=1 i=1 The running time, T , of the entire algorithm can now be bounded as E (T ) = E (T1 + T2 + T3) = E (T1) + E (T2) + E (T3) = O(n) : Exercises 1. If h is chosen at random from an almost-universal collection of hash functions and is used to hash N keys into a table of size B , the collision probability of any two particular keys x and y is at most 2=B, and the expected number of collisions involving a particular key x is at most 2(N ; 1)=B. Modify the algorithm above so that almost-universal functions are used instead of universal functions, and such that the expected running time remains O(n). 2. Modify the algorithm above and analyze it, so that the rst level function f maps the input set S into 2n buckets, instead of n buckets. 3. () Generalizing (2), modify the algorithm above and analyze it, so that the rst level function f maps the input set S into n buckets, and select that gives favorable complexity (in terms of constants). Fredman, Komlos, Szemeredi. Storing a sparse table with O(1) worst case access time, JACM, 31, 1984, pp 538-544.