Download Linear Hashing Is Awesome - IEEE Symposium on Foundations of

2016 IEEE 57th Annual Symposium on Foundations of Computer Science Linear Hashing is Awesome Mathias Bæk Tejs Knudsen Department of Computer Science University of Copenhagen Copenhagen, Denmark Email: [email protected] a very fundamental class of hash functions is h : rps Ñ rms (where rms t0, 1, . . . , m 1u) defined by hpxq ppax bq mod pq mod m, where a, b P rps are chosen uniformly at random. Here p is a prime and p ¥ m. Almost all students of computer science will come across it at least once in their curriculum, and it is e.g. the only randomized hash function mentioned in CLRS [1]. This makes the study of this particular hash function especially interesting. The study of hpxq boils down to the following two properties: (1) The collision probability is low, that is Prphpxq 1 hpy qq O m for x y. This follows from the 2independence of h. (2) hpxq is almost affine, in the following sense: hpxq hpy q hpx y q hp0q can only attain a constant number of different values. In 1979 Property (1) led Carter and Wegman [2] to introduce the concept of k-independence, which has since been a prominent part of the studies of hash functions. This was done by replacing the degree one polynomial in the definition of h with a random degree k 1 polynomial. From Property (1) one can obtain a number of results when using h as the hash function: (i) When inserting n keys into a table of size n using hashing with chaining the expected time used to insert any key is Op1q. (ii) When inserting n keys into a table of size n using hashing with chaining the expected size of the longest ? chain is Op nq. This corresponds to the expected worst case performance over the insertion of all n keys. (iii) Letting X tx1 , . . . , xn u be a set of n keys, hpx1 q is the minimum value of all hpxi q, xi P X with hash probability O ?1n . (iv) In a hash table implemented by linear probing using a table of size n and containing p1 ? εqn keys, the worst-case expected query time is Op nq. Property (2) is used by Baran et al. [3] to solve the 3SUM problem faster than Θpn2 q, but Property (1) seems to have more far reaching consequences than Property (2). Result (i) is tight as no matter which hash function is used the expected time to insert any element must be Ωp1q. However, it is not clear whether Results (ii), (iii) and (iv) Abstract—The most classic textbook hash function, e.g. taught in CLRS [MIT Press ’09], is hpxq ppax bq mod pq mod m , pq where x, a, b P t0, 1, . . . , p 1u and a, b are chosen uniformly at random. It is known that pq is 2-independent and almost uniform provided p is a prime and p " m. This implies that when using pq to build a hash table with chaining that contains n ¤ m keys, the expected query time ? is O p1q and the expected length of the longest chain is Op nq. This result holds for any 2-independent hash function. No hash function can improve on the expected query time, but the upper bound on the expected length of the longest chain is not known to be tight for pq. Partially addressing this problem, Alon et al. [STOC ’97] proved the existence of a class of linear hash functions such that the expected length of the longest chain is ? Ωp nq and leave as an open problem to decide which nontrivial properties pq has. We make the first progress on this fundamental problem, by showing that the expected length of the longest chain is at most n1{3 op1q which means that the performance of pq is similar to that of a 3-independent hash function for which we can prove an upper bound of O n1{3 . As a lemma we show that within a fixed set of integers there are few pairs such that the height of the ratio of the pairs are small. Given two non-zero coprime integers n, m P Z with n the height of m is max t|n| , |m|u, and the height is a way of measuring how complex a fraction is. This is proved using a mixture of techniques from additive combinatorics and number theory, and we believe that the result might be of independent interest. For a natural variation of pq, we show that it is possible to apply second order moment bounds even when a hash value is fixed. As a consequence: For min-wise hashing it was known that any key from a setof nkeys has the smallest hash value with probability O ?1n . We improve this to n1 op1q . For linear probing it was known that the worst case ? expected query time is O p nq. We improve this to nop1q . Keywords-hashing, linear hashing, hashing with chaining, additive combinatorics. I. I NTRODUCTION Hash functions are widely used and well studied within theoretical computer science. There are a vast number of different hash functions in the literature ranging from simple to complex, and some hash functions are constructed to perform well when used for specific purposes. In contrast 0272-5428/16 $31.00 © 2016 IEEE DOI 10.1109/FOCS.2016.45 344 345 hpxq hp2xq then with probability 12 it also holds that hp2xq hp3xq. Whenever we are interested in the behaviour of all keys together the it might not be important whether the collision probability of a triple is small for every triple or for most triples of keys. As a direct corollary of Theorem 1 we prove ? the upper bound in Result (2) can be improved from Op nq to n1{3 op1q : are tight. There are no results showing that the bounds in Results (ii)–(iv) are tight for h. Instead it is shown in [4], [5] that there exist 2-independent hash functions for which the upper bounds in Results (ii)–(iv) are tight. This shows that we either have to use Property (2) or find a new property in order to improve these bounds. Furthermore, Alon et al. [6] have shown that there exist linear hash functions meeting the bounds in Result (ii). These linear hash functions can easily be altered to be affine hash functions that satisfy Property (1) as well showing that even a combination of Property (1) and (2) will not suffice in order to improve on Result (ii). However in [6] it is also shown that there exists linear hash function (over the field of 2 elements) for which the maximum load is Oplog n log log nq. The discussion above leads to one of two conclusions. Either Property (1) and (2) captures the essence of h and we need to prove that Results (ii)–(iv) cannot be improved. Otherwise Results (ii)–(iv) can be improved and we need to find a new property of h that must be entirely different of Property (1) and (2). Corollary 1. Let U rus be a universe of keys and X U a set of n keys such that p ¡ un. Assume that m ¥ n and let M be the number of keys that popular hash to the most ( hash value, i.e. M maxvPrms x P X | hpxq v . Then E pM q ¤ n1{3 op1q . The same result holds when h is replaced with h̃ in the definition of M . In order to improve on Results (iii) and (iv) we prove the following result, that says that even if we fix one hash value, we can still use a certain type of second order moment bounds. Theorem 2. Let U rus be a universe of keys and X U a set of n keys such that p ¡ un. Let v P rms, I rms be a set ° of consecutive1 integers mod m and x0 P U zX. Let A xPX rh̃pxq P I s and μ E pAq. Here the expression rh̃pxq P I s is the Iverson bracket, which denotes a a number that is 1 if h̃pxq P I and 0 otherwise. Then for every δ ¡ 0 it holds that: ? P |A μ| ¥ δ μ | h̃px0 q v 2 n 1 op1q ¤ . n O δ2 pμ A. Our Results In the following let p be a prime and m an integer not larger than p. Let a, b P rps be independent and uniformly random integers. Let h : rps Ñ rps be defined by hpxq pYax b]q mod p. We let hpxq hpxq mod m and h̃pxq hpxqm . We can think of h and h̃ as extracting the least p and the most significant bits of h respectively. The main result of this paper is the proof of the following property of h and h̃: Using Theorem 2 we improve on Results (iii) and (iv). Theorem 1. Let U rus be a universe of keys and X U a set of n keys such that p ¡ un and m ¥ n. Let x, y, z P X be three keys chosen from X independently and uniformly at random, then: P hpxq hpy q hpz q O m2 n2 op1q , Corollary 2. Let U rus be a universe of keys and X a set of n keys such that p ¡ un. Let x0 P X. Then: ! ) 1 o p1 q . h̃pxq ¤n P h̃px0 q min U xPX ztx0 u Corollary 3. Let U rus be a universe of keys and X U a set of n keys such that p ¡ un. Assume that m ¥ p1 εqn for some constant ε ¡ 0, and say that all of the keys from X are inserted into a table of size m using h̃ with linear probing. Then the expected query time for any element u P U is nop1q . and the same result holds when h is replaced with h̃. Comparing Theorem 1 with the performance of a uniformly random hash function we see that if h had been uniformly random then we could have gotten the bound Opm2 n2 q instead, which follows from the fact that for any 3 distinct keys the probability that they have same hash value is exactly m2 . In fact this holds for any 3independent hash function. Except for the multiplicative term nop1q the main difference is that for a 3-independent hash function it holds that for every triple of distinct keys px, y, z q the probability that they all have the same hash value is small. Theorem 1 instead states that this property holds for most such triples. Considering the three keys px, 2x, 3xq it becomes apparent that it is not possible to prove that it holds for every triple as if we condition on B. Technical contribution The main technical contribution is an upper bound on the number of pairs within a set where the ratio of the pairs have small height. For a non-zero rational number x P Q the height H pxq of x is the smallest number such that there exists integers n, m P Z with |n| , |m| ¤ H pxq and n . Equivalently, if x m x m n for co-prime integers n, m 1 Meaning that for suitable values of r, s P N we have I 1q mod m, . . . , pr s 1q mod mu . tr mod m, pr 346 345 then H pxq max t|n| , |m|u, see e.g. [7]. We note that that H pxq H x1 . The main technical contribution is captured in the following theorem: A non-zero rational number x P Q can be written as x where m, n P Z are coprime. The height of x is defined as H pxq max t|m| , |n|u. For a positive integer n we let d2 pnq be the number of divisors of n. In general, for a positive integer r we let dr pnq be the number of ways n can be written as a product of a sequence of r positive integers, i.e.: m n Theorem 3. Let A Qz t0u be a set of n non-zero rational numbers and let k be an integer. The number a that H ¤ k is at most of pairs?pa, a1 q P A2 such a O nk 2 1 plog nqplog log 3kq . dr pnq |tpa1 , a2 , . . . , ar q P Nr | a1 a2 . . . ar The theorem is proved using a combination of tools from additive combinatorics and number theory. The use of this theorem comes from the fact that we can essentially quantify “how correlated” hpxq, hpy q, hpz q are by looking at the x . If the fraction has small height, then the hash fraction yz x values are very correlated and vice versa.2 epxx1 , ξ q epx, ξ qepx1 , ξ q, epx, ξξ 1 q epx, ξ 1 qepx, ξ 1 q, for all x, x1 , ξ, ξ 1 P G and such that for any x P G other than the identity there exists ξ P G such that epx, ξ q 1, and for any ξ P G other than the identity there exists x P G such that epx, ξ q 1. It is well-known [8] that such a bi-character exists for any finite abelian group. For a fixed group G and bi-character e we have the following definitions from Fourier analysis following the exposition from [8]. The Fourier transform fppξ q of a function f pxq on G is defined by: The structure of the paper is as follows. In Section II we introduce the necessary notation. In Section III we prove Theorem 3, and in Section IV we show how to use this result to prove new properties for h and h̃. Finally, in Section V we show that these properties imply a number of interesting results. Z denotes the integers, N the positive integers, Q the rational numbers, and Q the positive rational numbers. Zn Z{nZ denotes the integers modn. Zn is the set of elements of Zn having a multiplicative inverse; pZn , q is an abelian group. rns is shorthand for t0, 1, 2, . . . , n 1u. For a pair of integers n, m P Z such that pn, mq p0, 0q we let gcdpn, mq denote the greatest common divisor of n and m. If gcdpn, mq 1 then n and m are coprime. We write a | b to mean that a divides b and a b if a does not divide b. We use the Iverson bracket notation meaning that for a condition P we let rP s denote 1 if P is satisfied and 0 otherwise. log denotes the natural logarithm. Throughout the paper p will be a prime and m a positive integer smaller than p. We let h : rps Ñ rps be a random hash function defined by hpxq pax bq mod p where a, b P rps are chosen independently and uniformly at random. We will often consider h to be a mapping from Zp to Zp in the obvious way. We define the hash functions h, h̃ : rps Ñ rms by: h̃pxq 1 ¸ f pxqepx, ξ q . |G| x PG fppξ q II. P RELIMINARIES hpxq hpxq mod p, nu| . Let G be a finite abelian group. A bi-character e : G G Ñ Cz t0u is a function that satisfies C. Structure of the Paper Z With this definition we have the Fourier inversion formula: ¸ f pxq P fppξ qepx, ξ q . ξ G For functions f pxq and g pxq on G we define the convolution pf g qpxq to be: pf g qpxq We note that p{ f g qpξ q is: ¸ P ¸ 1 f paqg pbq . |G| a,bPG,abx fppξ qg ppξ q. 2 |f pxq| |G| x G Plancherel’s formula 2 ¸ fppξ q . P ξ G III. H EIGHTS The main theorem we prove in this section is: Theorem 3. Let A Qz t0u be a set of n non-zero rational numbers and let k be an integer. The number a of pairs?pa, a1 q P A2 such that H ¤ k is at most a ^ hpxq m . p O nk 2 For an integer x P Z we let rxsp P Zp denote its residue class mod p. We let ι : Zp Ñ rps be the unique mapping that satisfies rιpxqsp x. plog nqplog log 3kq . 1 We prove Theorem 3 by counting the number of solutions to ab a1 b1 where a, a1 P A and b, b1 are small numbers. An upper bound on this number is given in Lemma 1. 2 This is not entirely true as we need to introduce a concept of height mod p in order to justify this claim, but for the sake of the simplicity of the introduction this concept is not defined before needed. Lemma 1. Let A N be a set of n positive integers and B t1, 2, . . . , k u. Let t be the number of solutions to ab 347 346 a1 b1 where a, a1 P A, b, b1 P B. Then for any positive integer r: k ¸ 1 {r r t¤n 1 1 {r where we use Plancherel’s formula. By Hölder’s inequality we have: pdr piqq 2 | . ¤ S1 p q n1 2 pp A . { k pOplog 3kqqr . B qp qq 2 | x G | G| 3 P | Ap q| 2 | Bp G| 3 ¸ χ{χ P ξ G q| 2 , p A B qpξ q Bp q| (2) x ¸ χx ξ {p q 2{pr 1q |G| | A| 2 r 1 | P Ap q| 2 A p | | ξ G | r 1 q{pr1q . (3) G| B B qp | | | 2 Bp q| 2r q Bp q| 2r | B Bp q| 2 ξ G G| xPG |p B B qp q| 2 Since q ¡ k r for any positive integer n less than q we have r 1 that |G| pχB . . . χB qpnq is equal to the number of ways n can be written as a product of r integers where none is larger than k. This is equal to 0 for any n ¡ k r and for any n ¤ k r it is bounded by dr pnq. Combining this with (4) gives: 2 1 r 1 pχB . . . χB qpxq | G| S2 2r 1 |G| x PG ¸ 1 r ¸ χ χ x ¸P χx ξ χx ξ ¤ n Letting r loglog log 3k gives the desired. Now we turn to proving Lemma 1. The idea is to count the number of solutions using Fourier analysis. Subsequently, we give an upper bound on the corresponding sum using Hlder’s inequality. The result now comes from using the Fourier inversion formula on the upper bound. Proof of Lemma 1: Let r be a positive integer. Let q be a prime greater than maxaPA tau k r , and let G Zq . We note that A and B can be considered to be subsets of G and we will do so. Let χA , χB : G Ñ t0, 1u be the characteristic functions of A and B, respectively. For any positive integer n less than q we identify n with its residue mod q, i.e. we consider it to be an element of G. We note that |G| pχA χB qpnq is the number of pairs pa, bq P A B such that ab n since ab q for each a P A, b P B, and hence t is equal to t | G| | P ξ G ξ G By combining this with Lemma 1 we get: ¤ q| 2r 2 2 r, x k r , s 2 in Theorem 4 we get: r r Ap 1 r 2r r 1 2 n kr b 4t 2 will now bound S . We note that χx ξ We{ χ . . . χ ξ where there are r terms in the convolution. Hence we have ¸ χx ξ ¸ χ { S ... χ ξ P P 1 ¸ χ ... χ x . (4) s ¤ q| x plog x z s logp3z s qqz 1 Opzs {plogp3zs qqq e . pz s logp3z s qeγ 1 qzs where γ 0.577 . . . is Euler’s constant and the constant in the O-notation depends on ε. k pO plog 3k qq Bp where S1 , S2 are defined in the obvious way. We will bound S1 and S2 in turn starting with S1 . Note that by the definition of χA we have |χA pξ q| ¤ |A| for all ξ P G. This implies together with Plancherel’s |G| formula that x p | 1 Theorem 4 ([9]). Let ε ¡ 0 be a constant and ° let x ¥ 1, s ¥ 1 ε, and let z ¥ 2 be an integer. Then n¤x pdz pnqqs is bounded by: dr pnqq2 | P pr1q{r S 1{r , S 1 p 2 ξ G 1 q| r 1 r Before we prove Lemma 1 we will show how Theorem 3 follows from it. Proof of Theorem 3: First we note that by multiplying all the numbers of A with a sufficiently large integer we can wlog assume that A Zz t0u. Let A1 t|a| | a P Au. We will use Lemma 1 with A1 and B t1, 2, . . . , k u. We note that for every pair pa, a1 q P A such that H aa ¤ k there exist b, b1 P B such that |a| b |a1 | b1 . Furthermore each such equation can correspond to at most 4 pairs pa, a1 q P A where H aa ¤ k. So 4t where t is from Lemma 1 is an upper bound on the number of pairs pa, a1 q P A where the height of the ratio is ¤ k. We will need the following theorem proved in [9]. ¸ Ap ξ G i1 Letting ε 1, z ¸ χx ξ χx ξ P ¸ p q{ ¸ { {p q χx ξ χx ξ ¤ | G| ¸d kr 1 2r 1 p r pxqq 2 . (5) x 1 Combing (1), (2), (3), and (5) gives 3 t ¤ | G| | A| A p | { q{ r 1 r ¸ | 1 1 r | G| G| 1 r dr pnqq p 2 ¸d p { 1 r kr 2r 1 { | kr 1 r pnqq 2 n 1 , n 1 as desired. In the reductions in Section IV we will be interested in the sum of the reciprocals of the heights studied in Theorem 3. Corollary 4 gives an upper bound on this quantity. 2 Corollary 4. Let A Q be a set of n positive rational numbers. Then: ? 1 O p plog nqplog log 3nqq ¤ n2 . a H a a,a PA ¸ (1) ξ G 1 348 347 1 Proof: First we prove (9). Fix integers a, b P Z that sat1 1 1 isfy |a| , |b| ¤ Hp rxsp ry sp and rxsp ry sp rasp rbsp . Then p | xb ay. Either |xb ay | p or |xb ay | ¥ p. If |xb ay | p then since p | xb ay we have xb ay 1 x a ¥ H xy , and y b . This implies that Hp rxsp ry sp and so (9) holds. Otherwise |xb ay | ¥ p. By the triangle 1 inequality p ¤ |xb ay | ¤ Hp rxsp ry sp p|x| |y |q and therefore (9) also holds in this case. X? \ Now! we turn to proving (10). p and ) Let z 1 S r0sp x, r1sp x, . . . , rz sp x , then S contains z distinct elements. Write S as S ts0 , s1 , . . . , sz u where the elements of S are ordered such °z that ιps0 q . . . ιpsz q. Let sz 1 s0 , then we have i0 ιpsi 1 q ιpsi q p. Hence ? there must exist an index i such that si 1 si ¤ z p 1 p. By the definition of S we have that si 1 si tx for ?p. So we can write some t P Zp with |t| ¤ z ? 1 x tpsi 1 si q with |t| , |si 1 si | p and hence ? Hp pxq p which proves (10). We will need the following technical lemma. Lemma 3 uses the fact that that the numbers rism x1 for i 0, 1, . . . , ιpxq 1 are essentially evenly spaced in Zm for any x P Zm . This implies that given an interval I only a I| fraction of about |m of these numbers can be contained in I. Proof: For k a positive integer let Vk be the subset of A2 containing pairs where the height of the ratio is a most k, i.e.: ) ! a Vk pa, a1 q P A | H 1 ¤ k . a For convenience let V0 H. Then we know that: ¸ 1 a,a1 PA H ¸ 1 8 a a1 k 1 ¸ 8 Firstly we note that: ¸ 8 k n 1 ¤ |V k | ¸ 8 k n 1 n2 1 k k | V k zV k 1 | |V k | k11 1 k . (6) k 1 1 nn 1 n . (7) k 1 k11 1 k 2 For any ? k ¤ n we see by Theorem 3 that |Vk | O p plog nqplog log 3nqq nk2 . Hence: ņ k ņ |Vk | k 1 ¤ |Vkk2 | 1 ? log n log log 3n k 1 O ¤ n2 p p 1 k 1 ¤ qp qq . Lemma 3. Let m be?a positive integer and et x, y P Zm be elements satisfying m ¥ |x| ¥ |y | and gcdp|x| , |y |q 1 and let I Zm be an interval. The number of elements ||x| r P r|x|s such that rrsm yx1 P I is at most |Im 2. (8) Combining (6), (7), and (8) yields the desired conclusion. Proof: If ιpxq |x| we can replace x and y by x and y, so wlog assume that ιpxq |x|. ( 1 !QLet A U rism x ) | i P t0, 1, . . . , ιpxq 1u and B im | i P r|x|s . First we prove that A B. Since x m IV. R EDUCTION The goal of this section is to prove Theorem 1. ! For x P Zp , r P N we)let I px, rq denote the set x, x r1sp , . . . , x rr 1sp . A subset A Zp is called an interval if it is on the form I px, rq. A subset B Zp is called a generalized interval if B xA for some interval A and x P Zp . For x P Zp we let |x| be defined as | |x| min tιpxq, ιpxqu For x P Zp we let Hp pxq be the restricted height defined m hence ιpiqx1 P B, so we conclude that A B. Say that there are k ¥ 2 elements r P r|x|s such that ryx1 P I. (If k ¤ 1 there is nothing to prove.) If |y | ιpy q we let I 1 I t0, 1, . . . , |y | 1u and otherwise we let I 1 I t0, 1, . . . , |y | 1u. We note that |I 1 | |I | |y |. For each r P r|x|s that satisfies rrsm yx1 P I there is an element a P A such that a P I 1 . Since I 1 is an interval there exists r P Zm , s P Z such that I 1 tr, r r1sm , . . . , r rs 1sm u. Since I 1 contains k elements from B the interval rr, r sq contains k elements on the form |ip x| , i P Z, and hence s ¥ pk 1q |xp| . Using that s |I 1 | |I | |y| we get that: by: Hp pxq min max t|a| , |b|u | a, b P Zp , x ab1 ( We note that Hp pxq Hp x1 . In Lemma 2 it is proved ? that Hp pxq ¤ p. The following lemma also shows that if x, y is much smaller than p, then if the height of xy is large then the restricted height is large as well. Lemma 2. Let x, y Hp P Z be integers not divisible by p, then: " rxsp rysp ¥ min |x| |y| , H ? Hp pxq p . 1 p * x y , | A and B are finite sets of the same cardinality it suffices to prove that A B. Let i P t0, 1, . . . , ιpxq 1u and j P r|x|s be the unique integer satisfying |x| | mj Q U ιpiq. mj ιpiq mj ιpiq 1 Then ιpiqx and |x| mj and |x| |x| (9) k (10) 349 348 ¤ I |mx| 1 1 |I | |x| |x| |y| m m 1¤ |I | |x| m 2. By Lemma 3 we have that |Sy X I | is bounded by |Ip|t so we get Lemma 4 below show how we can connect the concept of heights to linear hash functions. Given elements x, y P Zp it gives an upper bound on the probability that pax, ay q P I 2 , when a P Zp is chosen uniformly at random. If the random variables ax and ay had been independent (they are clearly 2 not!) then the probability would have been exactly p|I | {pq . 1 Lemma 4 shows that when the restricted height of xy is 2 large the probability is close to p|I | {pq . Lemma 4. Let x, y P Z and I Z be a generalized interval. Then: tP0 ¤ P ppax, ay q P I 2 q ¤ pP pax P I qq 2 O P pax P I q Hp pxy 1 q 1 p , tP0 ιpxqP0 E ¸ P ph̃py q, h̃pz qq P I 2 | h̃pxq v ¤ sPS psx, sy q P I (11) 2 |I | m n 1 o p1 q |I | m O p p 1 q . (12) If I consists of a single element the statement also holds if we replace h̃ with h. Proof: Fix y, z P X z txu. Later in the proof we will unfix y, z and consider them to be random variables. Let J h̃1 pI q, then J is an interval. (If I is a singleton, then 1 h pI q is a generalized interval, showing that the statement also holds if we replace h̃ with h in this case.) Let u P h̃1 pv q. We will show that (12) holds when we condition on hpxq u instead of h̃pxq v. Since u is arbitrarily chosen this will be sufficient. If hpxq u and ph̃py q, h̃pz qq P I 2 then phpy q hpxq, hpz q hpxqq P pJ uq2 , i.e. papy xq, apz xqq P pJ uq2 . Since h is 2-independent hpxq and hpy q hpxq apy xq are independent, showing that hpxq and a are independent. So we conclude that: ) 2 Theorem 5. Let U rus be a universe of keys and X U a set of n keys such that p ¡ un. Let x P X, v P rms be a hash value and let I rms be an interval when considered as a subset of Zm . Let y, z P X z txu be chosen independently and uniformly at random, then: P ph̃py q, h̃pz qq P I 2 | hpxq u Let S a, r1sp x1 , . . . , rιpxq 1sp x1 . By linearity of expectation we have: which finishes the proof as P pax P I q |pI | . Instead of proving Theorem 1 we will prove a slightly more general theorem from which Theorem 1 follows. The idea of the proof is the following (assuming that I is an interval for simplicity). By multiplying x, y with an appropriate factor we can wlog assume |y | ¤ ιpxq ¤ ? Hp pxy 1 q p. Instead of directly calculating the probability that pax, ay q P I 2 we) consider the set S ! a, r1sp x1 , . . . , rιpxq 1sp x1 and upper bound the expected number of elements s P S satisfy psx, sy q P I 2 . By linearity of expectation this will allow us to upper bound the desired probability. This turns out to be easier for the following reason. The set Sx is an interval and hence, for most values of a, it is either contained in I or disjoint from I. When it is disjoint from I there are no elements s P S such that psx, sy q P I 2 . When Sx is contained in I we just need to calculate |Sy X I |, which we do by using Lemma 3. The details are given below. Proof: Firstly, we see that I zJ for some interval J and z P Zp . Since pax, ay q P I I iff paxz, ayz q P J J we can replace x, y, I with xz, yz, J respectively and wlog assume that I is an interval in the following. For any P Zp the probability that pax, ay q P I I does not change if we exchange x and y with x and y respectively. Let Hp pxy 1 q t, then there exists such that |x| , |y| ¤ t. Since we may also swap x and y we can ? wlog assume that |y | ¤ ιpxq t p. We let T be the set of elements s P Zp such that psx, sy q P I 2 , i.e. T x1 I X y 1 I. Let a P Zp be chosen uniformly at random and let P0 be the probability that pax, ay q P I 2 , ! 2 P pSx X I Hq . |I | t |I | 2 p t p p 2 |I | |I | 2 , 3 p pt p P0 ¤ where a P Zp is chosen uniformly at random. P0 P ppax, ay q P I 2 q . |I | t p Since Sx is an interval consisting of t elements, we have that the probability that Sx X I H less than |I |p t . Inserting this gives p p 2, P papy xq, apz xqq P pJ uq2 | hpxq u P papy xq, apz xqq P pJ uq2 . (13) , By Lemma 4 we get: P papy xq, apz xqq P pJ uq2 Clearly the number of elements s P S such that psx, sy q P I 2 is bounded by |Sx X I | and |Sy P XI |. Inserting this gives: ¤ tP0 ¤ E pmin t|Sx X I | , |Sy X I |uq . 350 349 |J u| p 2 O Hp 1yx |J u| z x p 1 . p (14) I| Since |J u| p ¤ |m 1 (14) to conclude that: P p ¤ 1 p h̃py q, h̃pz qq P I 2 | h̃pxq v I m | | 2 1 O Hp moment bounds. If h̃ had been 3-independent the upper bound in (20) could have been replaced with O δ 2 . In 2 n term, most cases the nop1q term in (20) will dominate the pμ and we will be able to prove bounds that are only worse by a factor of nop1q than the corresponding bounds for 3independent hash functions. we can combine (13) and y x z x I m 1 . p | | (15) Theorem 6. Let U rus be a universe of keys and X U a set of n keys such that p ¡ un. Let°v P rms, I rms be an interval and x0 P U zX. Let A xPX rh̃pxq P I s. Then μ E pAq satisfies Now we unfix y, z and consider y, z P X z txu to be chosen uniformly at random. By averaging over (15) we get: P I m | | p h̃py q, h̃pz qq P I 2 | h̃pxq v ¸ 1 O pn 1q2 2 y 1 ,z 1 P X ztxu Hp I | | y 1 x z 1 x ¤ 1 . (16) p m μ E pA | h̃px0 q v q Θ n and for every δ P By Lemma 2 we see that: Hp 1 y 1 x z 1 x ! ¤ 1 min Ωpnq, H ¤ O 1 n H y 1 x z 1 x 1 y 1 x z 1 x ) . n O p m 2 n2 o p1 q q n 2 o p1q 2 Letting σ 2 | J| p op1q n O n2 pμ , (19) . (20) 2 n1 o p1 q | J| p O n 2 p 1 . (21) V pA | h̃px0 q v q, we get: σ2 E A2 | h̃px0 q v μ nop1q O n2 pμ μ2 . By an application of Chebyshev’s Inequaliy we now get Equation (20). Consider linear probing. In [4] it is proved that when using a 3-independent hash function the worst case expected query time for any element is O plog nq. This is done by fixing the hash value of the element we query for, and then using second order moment bounds. By using Theorem 6 instead of the usual second order moment bound for 3independent hash functions we get an upper bound that is nop1q Oplog nq nop1q instead. This proves Corollary 3. Similarly, we can prove that h̃ performs almost as well as 3-independent hash functions when using it for minwise hashing. This is done in Corollary 2 below. We should 1 o p1 q compare to the upper bound of the upper bound of n log n O n that holds for 3-independent hash functions [5]. Proof: Fix h. Choosing x, y, z P X randomly the probability 3 that all of the keys are in the same largest bucket is M . But the average probability of this over all choices n of h is Opm2 n2 op1q q. Hence: Corollary 1. Let U rus be a universe of keys and X U a set of n keys such that p ¡ un. Assume that m ¥ n and let M be the number of keys that popular hash to the most ( hash value, i.e. M maxvPrms x P X | hpxq v . Then E pM q ¤ n1{3 op1q . The same result holds when h is replaced with h̃ in the definition of M . E ? A μ| ¥ δ μ | h̃px0 q v E A2 | h̃px0 q v In this section we apply the results from Section IV to show performance guarantees when using h and h̃ for hash tables with chaining, for min-wise hashing and for linear probing. We first prove that Corollary 1 is a corollary of Theorem 1. The idea in the proof is that if there exists a large bucket, then there must be a high probability that three random keys have the same hash values. 3 | 0 it holds that: Proof: (19) follows from the 2-independence of h̃. By Theorem 5: (17) V. A PPLICATIONS M n 1 ¤ δ2 We now see that Corollary 4 applied to (17) in combination with (16) gives the desired upper bound. ¡ I m | | . (18) Corollary 2. Let U rus be a universe of keys and X a set of n keys such that p ¡ un. Let x0 P X. Then: Applying Jensen’s inequality to the convex function x Ñ x3 now gives E pM q n1{3 op1q as desired. The following restatement of Theorem 2 shows that even if we fix one hash value we can still use second order P 351 350 h̃px0 q min xPX ztx0 u ! h̃pxq ) ¤ n 1 o p1 q . U Proof: Let X 1 X z tx0 u. For v P rms let Iv and let Av be the random variable defined by: Av 1s ¸ rh̃pxq ¤ vs ¸ rh̃pxq P I s . ! ) [7] E. Bombieri and W. Gubler, Heights in Diophantine geometry. Cambridge University Press, 2007, vol. 4. [8] T. Tao, “Lecture notes 2 for 254a.” [Online]. Available: https://www.math.cmu.edu/ af1p/Teaching/AdditiveCombinatorics/Tao.pdf v x PX 1 x PX 1 Now we note that h̃px0 q minxPX v rv h̃pxq iff there exists 1 [9] K. K. Norton, “Upper bounds for sums of powers of divisor functions,” Journal of Number Theory, vol. 40, no. 1, pp. 60– 85, 1992. P rms such that h̃px0 q v and Av 0. So we get: P ! ) h̃px0 q min h̃pxq ¸ P A 0 | h̃px q v P h̃px q v . x PX 1 v 0 0 v Prms P h̃px q v O m aSince E pA q gives: ! ) P h̃px q min h̃pxq ¸ " n ¤O 1 min 1, 0 1 Theorem 6 with δ v 0 x PX 1 1 m n 1 o p1 q v Prms O ?1p n v 1 o p1 q m 1 o p1 q O m2 ppv 1q2 * . ACKNOWLEDGMENT The author would like to thank the anonymous referees for the most thorough review he has ever received. The author also thanks Rikke Langhede for introducing him to the concept of heights, and Mikkel Thorup for helpful comments. R EFERENCES [1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms (3. ed.). MIT Press, 2009. [2] L. Carter and M. N. Wegman, “Universal classes of hash functions,” J. Comput. Syst. Sci., vol. 18, no. 2, pp. 143–154, 1979. [3] I. Baran, E. D. Demaine, and M. Pǎtraşcu, “Subquadratic algorithms for 3SUM,” Algorithmica, vol. 50, no. 4, pp. 584– 596, 2008. [4] M. Pǎtraşcu and M. Thorup, “On the k-independence required by linear probing and minwise independence,” ACM Trans. Algorithms, vol. 12, no. 1, p. 8, 2016. [5] M. B. T. Knudsen and M. Stöckel, “Quicksort, largest bucket, and min-wise hashing with limited independence,” in Algorithms - ESA 2015 - 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings, 2015, pp. 828– 839. [6] N. Alon, M. Dietzfelbinger, P. B. Miltersen, E. Petrank, and G. Tardos, “Linear hash functions,” Journal of the ACM (JACM), vol. 46, no. 5, pp. 667–683, 1999. 352 351

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Linear Hashing Is Awesome - IEEE Symposium on Foundations of