Download Linear Hashing Is Awesome - IEEE Symposium on Foundations of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mathematical proof wikipedia , lookup

List of important publications in mathematics wikipedia , lookup

Georg Cantor's first set theory article wikipedia , lookup

Vincent's theorem wikipedia , lookup

Nyquist–Shannon sampling theorem wikipedia , lookup

Four color theorem wikipedia , lookup

Laws of Form wikipedia , lookup

Fermat's Last Theorem wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Wiles's proof of Fermat's Last Theorem wikipedia , lookup

Birthday problem wikipedia , lookup

Brouwer fixed-point theorem wikipedia , lookup

Theorem wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Central limit theorem wikipedia , lookup

Non-standard calculus wikipedia , lookup

Fundamental theorem of calculus wikipedia , lookup

Fundamental theorem of algebra wikipedia , lookup

Proofs of Fermat's little theorem wikipedia , lookup

Transcript
2016 IEEE 57th Annual Symposium on Foundations of Computer Science
Linear Hashing is Awesome
Mathias Bæk Tejs Knudsen
Department of Computer Science
University of Copenhagen
Copenhagen, Denmark
Email: [email protected]
a very fundamental class of hash functions is h : rps Ñ rms
(where rms t0, 1, . . . , m 1u) defined by hpxq ppax
bq mod pq mod m, where a, b P rps are chosen uniformly at
random. Here p is a prime and p ¥ m. Almost all students
of computer science will come across it at least once in
their curriculum, and it is e.g. the only randomized hash
function mentioned in CLRS [1]. This makes the study of
this particular hash function especially interesting.
The study of hpxq boils down to the following two
properties:
(1) The collision probability
is low, that is Prphpxq 1
hpy qq O m
for x y. This follows from the 2independence of h.
(2) hpxq is almost affine, in the following sense: hpxq
hpy q hpx y q hp0q can only attain a constant number
of different values.
In 1979 Property (1) led Carter and Wegman [2] to introduce
the concept of k-independence, which has since been a
prominent part of the studies of hash functions. This was
done by replacing the degree one polynomial in the definition of h with a random degree k 1 polynomial. From
Property (1) one can obtain a number of results when using
h as the hash function:
(i) When inserting n keys into a table of size n using
hashing with chaining the expected time used to insert
any key is Op1q.
(ii) When inserting n keys into a table of size n using
hashing with
chaining the expected size of the longest
?
chain is Op nq. This corresponds to the expected worst
case performance over the insertion of all n keys.
(iii) Letting X tx1 , . . . , xn u be a set of n keys, hpx1 q
is the minimum
value of all hpxi q, xi P X with
hash
probability O ?1n .
(iv) In a hash table implemented by linear probing using
a table of size n and containing p1 ?
εqn keys, the
worst-case expected query time is Op nq.
Property (2) is used by Baran et al. [3] to solve the 3SUM
problem faster than Θpn2 q, but Property (1) seems to have
more far reaching consequences than Property (2).
Result (i) is tight as no matter which hash function is
used the expected time to insert any element must be Ωp1q.
However, it is not clear whether Results (ii), (iii) and (iv)
Abstract—The most classic textbook hash function, e.g.
taught in CLRS [MIT Press ’09], is
hpxq ppax
bq mod pq mod m ,
pq
where x, a, b P t0, 1, . . . , p 1u and a, b are chosen uniformly
at random. It is known that pq is 2-independent and almost
uniform provided p is a prime and p " m. This implies
that when using pq to build a hash table with chaining that
contains n ¤ m keys, the expected query time
? is O p1q and
the expected length of the longest chain is Op nq. This result
holds for any 2-independent hash function. No hash function
can improve on the expected query time, but the upper bound
on the expected length of the longest chain is not known to
be tight for pq. Partially addressing this problem, Alon et
al. [STOC ’97] proved the existence of a class of linear hash
functions
such that the expected length of the longest chain is
?
Ωp nq and leave as an open problem to decide which nontrivial properties pq has. We make the first progress on this
fundamental problem, by showing that the expected length of
the longest chain is at most n1{3 op1q which means that the
performance of pq is similar to that of a 3-independent
hash
function for which we can prove an upper bound of O n1{3 .
As a lemma we show that within a fixed set of integers there
are few pairs such that the height of the ratio of the pairs are
small. Given two non-zero coprime integers n, m P Z with
n
the height of m
is max t|n| , |m|u, and the height is a way of
measuring how complex a fraction is. This is proved using a
mixture of techniques from additive combinatorics and number
theory, and we believe that the result might be of independent
interest.
For a natural variation of pq, we show that it is possible to
apply second order moment bounds even when a hash value
is fixed. As a consequence:
For min-wise hashing it was known that any key from a
setof nkeys has the smallest hash value with probability
O ?1n . We improve this to n1 op1q .
For linear probing it was known that the worst case
?
expected query time is O p nq. We improve this to nop1q .
Keywords-hashing, linear hashing, hashing with chaining,
additive combinatorics.
I. I NTRODUCTION
Hash functions are widely used and well studied within
theoretical computer science. There are a vast number of
different hash functions in the literature ranging from simple
to complex, and some hash functions are constructed to
perform well when used for specific purposes. In contrast
0272-5428/16 $31.00 © 2016 IEEE
DOI 10.1109/FOCS.2016.45
344
345
hpxq hp2xq then with probability 12 it also holds that
hp2xq hp3xq.
Whenever we are interested in the behaviour of all keys
together the it might not be important whether the collision
probability of a triple is small for every triple or for most
triples of keys. As a direct corollary of Theorem 1 we prove
?
the upper bound in Result (2) can be improved from Op nq
to n1{3 op1q :
are tight. There are no results showing that the bounds in
Results (ii)–(iv) are tight for h. Instead it is shown in [4], [5]
that there exist 2-independent hash functions for which the
upper bounds in Results (ii)–(iv) are tight. This shows that
we either have to use Property (2) or find a new property in
order to improve these bounds. Furthermore, Alon et al. [6]
have shown that there exist linear hash functions meeting the
bounds in Result (ii). These linear hash functions can easily
be altered to be affine hash functions that satisfy Property
(1) as well showing that even a combination of Property
(1) and (2) will not suffice in order to improve on Result
(ii). However in [6] it is also shown that there exists linear
hash function (over the field of 2 elements) for which the
maximum load is Oplog n log log nq.
The discussion above leads to one of two conclusions.
Either Property (1) and (2) captures the essence of h and
we need to prove that Results (ii)–(iv) cannot be improved.
Otherwise Results (ii)–(iv) can be improved and we need to
find a new property of h that must be entirely different of
Property (1) and (2).
Corollary 1. Let U rus be a universe of keys and X € U
a set of n keys such that p ¡ un. Assume that m ¥ n and
let M be the number of keys that
popular
hash to the most (
hash value, i.e. M maxvPrms x P X | hpxq v . Then
E pM q ¤ n1{3 op1q . The same result holds when h is
replaced with h̃ in the definition of M .
In order to improve on Results (iii) and (iv) we prove
the following result, that says that even if we fix one hash
value, we can still use a certain type of second order moment
bounds.
Theorem 2. Let U rus be a universe of keys and X € U
a set of n keys such that p ¡ un. Let v P rms, I € rms be
a set °
of consecutive1 integers mod m and x0 P U zX. Let
A xPX rh̃pxq P I s and μ E pAq. Here the expression
rh̃pxq P I s is the Iverson bracket, which denotes a a number
that is 1 if h̃pxq P I and 0 otherwise. Then for every δ ¡ 0
it holds that:
?
P |A μ| ¥ δ μ | h̃px0 q v
2 n
1
op1q
¤
.
n
O
δ2
pμ
A. Our Results
In the following let p be a prime and m an integer not
larger than p. Let a, b P rps be independent and uniformly
random integers. Let h : rps Ñ rps be defined by hpxq pYax
b]q mod p. We let hpxq hpxq mod m and h̃pxq hpxqm
. We can think of h and h̃ as extracting the least
p
and the most significant bits of h respectively. The main
result of this paper is the proof of the following property of
h and h̃:
Using Theorem 2 we improve on Results (iii) and (iv).
Theorem 1. Let U rus be a universe of keys and X € U
a set of n keys such that p ¡ un and m ¥ n. Let x, y, z P X
be three keys chosen from X independently and uniformly
at random, then:
P hpxq hpy q hpz q O m2 n2 op1q ,
Corollary 2. Let U rus be a universe of keys and X
a set of n keys such that p ¡ un. Let x0 P X. Then:
!
)
1 o p1 q
.
h̃pxq
¤n
P h̃px0 q min
€
U
xPX ztx0 u
Corollary 3. Let U rus be a universe of keys and X € U
a set of n keys such that p ¡ un. Assume that m ¥ p1 εqn
for some constant ε ¡ 0, and say that all of the keys from
X are inserted into a table of size m using h̃ with linear
probing. Then the expected query time for any element u P U
is nop1q .
and the same result holds when h is replaced with h̃.
Comparing Theorem 1 with the performance of a uniformly random hash function we see that if h had been
uniformly random then we could have gotten the bound
Opm2 n2 q instead, which follows from the fact that
for any 3 distinct keys the probability that they have same
hash value is exactly m2 . In fact this holds for any 3independent hash function. Except for the multiplicative
term nop1q the main difference is that for a 3-independent
hash function it holds that for every triple of distinct keys
px, y, z q the probability that they all have the same hash
value is small. Theorem 1 instead states that this property
holds for most such triples. Considering the three keys
px, 2x, 3xq it becomes apparent that it is not possible to
prove that it holds for every triple as if we condition on
B. Technical contribution
The main technical contribution is an upper bound on the
number of pairs within a set where the ratio of the pairs
have small height. For a non-zero rational number x P Q
the height H pxq of x is the smallest number such that
there exists integers n, m P Z with |n| , |m| ¤ H pxq and
n
. Equivalently, if x m
x m
n for co-prime integers n, m
1 Meaning
that for suitable values of r, s P N we have I
1q mod m, . . . , pr s 1q mod mu .
tr mod m, pr
346
345
then H pxq max t|n| , |m|u, see e.g. [7]. We note that
that H pxq H x1 . The main technical contribution is
captured in the following theorem:
A non-zero rational number x P Q can be written as x where m, n P Z are coprime. The height of x is defined
as H pxq max t|m| , |n|u.
For a positive integer n we let d2 pnq be the number of
divisors of n. In general, for a positive integer r we let dr pnq
be the number of ways n can be written as a product of a
sequence of r positive integers, i.e.:
m
n
Theorem 3. Let A € Qz t0u be a set of n non-zero
rational numbers and let k be an integer.
The number
a
that
H
¤
k
is at most
of pairs?pa, a1 q P A2 such
a
O
nk 2
1
plog nqplog log 3kq .
dr pnq |tpa1 , a2 , . . . , ar q P Nr | a1 a2 . . . ar
The theorem is proved using a combination of tools from
additive combinatorics and number theory. The use of this
theorem comes from the fact that we can essentially quantify
“how correlated” hpxq, hpy q, hpz q are by looking at the
x . If the fraction has small height, then the hash
fraction yz
x
values are very correlated and vice versa.2
epxx1 , ξ q epx, ξ qepx1 , ξ q, epx, ξξ 1 q epx, ξ 1 qepx, ξ 1 q,
for all x, x1 , ξ, ξ 1 P G and such that for any x P G other than
the identity there exists ξ P G such that epx, ξ q 1, and for
any ξ P G other than the identity there exists x P G such
that epx, ξ q 1. It is well-known [8] that such a bi-character
exists for any finite abelian group.
For a fixed group G and bi-character e we have the
following definitions from Fourier analysis following the exposition from [8]. The Fourier transform fppξ q of a function
f pxq on G is defined by:
The structure of the paper is as follows. In Section II we
introduce the necessary notation. In Section III we prove
Theorem 3, and in Section IV we show how to use this result
to prove new properties for h and h̃. Finally, in Section V
we show that these properties imply a number of interesting
results.
Z denotes the integers, N the positive integers, Q the
rational numbers, and Q the positive rational numbers.
Zn Z{nZ denotes the integers modn. Zn is the set of
elements of Zn having a multiplicative inverse; pZn , q is
an abelian group. rns is shorthand for t0, 1, 2, . . . , n 1u.
For a pair of integers n, m P Z such that pn, mq p0, 0q
we let gcdpn, mq denote the greatest common divisor of n
and m. If gcdpn, mq 1 then n and m are coprime. We
write a | b to mean that a divides b and a b if a does not
divide b. We use the Iverson bracket notation meaning that
for a condition P we let rP s denote 1 if P is satisfied and
0 otherwise. log denotes the natural logarithm.
Throughout the paper p will be a prime and m a positive
integer smaller than p. We let h : rps Ñ rps be a
random hash function defined by hpxq pax bq mod p
where a, b P rps are chosen independently and uniformly
at random. We will often consider h to be a mapping from
Zp to Zp in the obvious way. We define the hash functions
h, h̃ : rps Ñ rms by:
h̃pxq 1 ¸
f pxqepx, ξ q .
|G|
x PG
fppξ q II. P RELIMINARIES
hpxq hpxq mod p,
nu| .
Let G be a finite abelian group. A bi-character e : G G Ñ Cz t0u is a function that satisfies
C. Structure of the Paper
Z
With this definition we have the Fourier inversion formula:
¸
f pxq P
fppξ qepx, ξ q .
ξ G
For functions f pxq and g pxq on G we define the convolution
pf g qpxq to be:
pf g qpxq We note that p{
f g qpξ q
is:
¸
P
¸
1
f paqg pbq .
|G|
a,bPG,abx
fppξ qg
ppξ q.
2
|f pxq| |G|
x G
Plancherel’s formula
2
¸ fppξ q .
P
ξ G
III. H EIGHTS
The main theorem we prove in this section is:
Theorem 3. Let A € Qz t0u be a set of n non-zero
rational numbers and let k be an integer.
The number
a
of pairs?pa, a1 q P A2 such
that
H
¤
k
is at most
a
^
hpxq m
.
p
O
nk 2
For an integer x P Z we let rxsp P Zp denote its residue
class mod p. We let ι : Zp Ñ rps be the unique mapping
that satisfies rιpxqsp x.
plog nqplog log 3kq .
1
We prove Theorem 3 by counting the number of solutions
to ab a1 b1 where a, a1 P A and b, b1 are small numbers.
An upper bound on this number is given in Lemma 1.
2 This is not entirely true as we need to introduce a concept of height
mod p in order to justify this claim, but for the sake of the simplicity of
the introduction this concept is not defined before needed.
Lemma 1. Let A € N be a set of n positive integers and
B t1, 2, . . . , k u. Let t be the number of solutions to ab 347
346
a1 b1 where a, a1 P A, b, b1 P B. Then for any positive integer
r:
k
¸
1 {r
r
t¤n
1 1 {r
where we use Plancherel’s formula. By Hölder’s inequality
we have:
pdr piqq
2
|
.
¤
S1
p
q
n1
2
pp
A
.
{ k pOplog 3kqqr .
B qp
qq
2
|
x G
|
G|
3
P
|
Ap
q|
2
|
Bp
G|
3
¸ χ{χ
P
ξ G
q|
2
,
p
A
B qpξ q
Bp
q|
(2)
x
¸ χx ξ
{p q
2{pr 1q
|G|
|
A|
2 r 1
|
P
Ap
q|
2
A p
|
|
ξ G
|
r 1
q{pr1q
. (3)
G|
B
B qp
|
|
|
2
Bp
q|
2r
q
Bp
q|
2r
|
B
Bp
q|
2
ξ G
G| xPG
|p
B
B qp
q|
2
Since q ¡ k r for any positive integer n less than q we have
r 1
that |G|
pχB . . . χB qpnq is equal to the number of ways
n can be written as a product of r integers where none is
larger than k. This is equal to 0 for any n ¡ k r and for
any n ¤ k r it is bounded by dr pnq. Combining this with
(4) gives:
2
1
r 1
pχB . . . χB qpxq
| G|
S2 2r 1
|G|
x PG
¸ 1 r
¸ χ χ x
¸P χx ξ χx ξ
¤
n
Letting r loglog
log 3k gives the desired.
Now we turn to proving Lemma 1. The idea is to count the
number of solutions using Fourier analysis. Subsequently,
we give an upper bound on the corresponding sum using
Hlder’s inequality. The result now comes from using the
Fourier inversion formula on the upper bound.
Proof of Lemma 1: Let r be a positive integer. Let q
be a prime greater than maxaPA tau k r , and let G Zq .
We note that A and B can be considered to be subsets of
G and we will do so. Let χA , χB : G Ñ t0, 1u be the
characteristic functions of A and B, respectively. For any
positive integer n less than q we identify n with its residue
mod q, i.e. we consider it to be an element of G. We note
that |G| pχA χB qpnq is the number of pairs pa, bq P A B
such that ab n since ab q for each a P A, b P B, and
hence t is equal to
t | G|
|
P
ξ G
ξ G
By combining this with Lemma 1 we get:
¤
q|
2r
2
2
r, x k r , s 2 in Theorem 4 we get:
r r
Ap
1 r
2r r 1
2
n kr
b 4t
2
will now bound
S . We note that χx ξ
We{
χ . . . χ ξ where there are r terms in the convolution. Hence we have
¸ χx ξ ¸ χ {
S
... χ ξ
P
P
1 ¸
χ
... χ x .
(4)
s
¤
q|
x
plog x z s logp3z s qqz 1 Opzs {plogp3zs qqq
e
.
pz s logp3z s qeγ 1 qzs
where γ 0.577 . . . is Euler’s constant and the constant in
the O-notation depends on ε.
k pO plog 3k qq
Bp
where S1 , S2 are defined in the obvious way. We will bound
S1 and S2 in turn starting with S1 .
Note that by the definition of χA we have |χA pξ q| ¤
|A| for all ξ P G. This implies together with Plancherel’s
|G|
formula that
x
p
|
1
Theorem 4 ([9]). Let ε ¡ 0 be a constant and
° let x ¥ 1, s ¥
1 ε, and let z ¥ 2 be an integer. Then n¤x pdz pnqqs is
bounded by:
dr pnqq2
|
P
pr1q{r S 1{r ,
S
1
p
2
ξ G
1
q|
r 1 r
Before we prove Lemma 1 we will show how Theorem 3
follows from it.
Proof of Theorem 3: First we note that by multiplying
all the numbers of A with a sufficiently large integer we can
wlog assume that A € Zz t0u. Let A1 t|a| | a P Au. We
will use Lemma 1 with A1 and B t1, 2, . . . , k u. We note
that for every pair pa, a1 q P A such that H aa ¤ k there
exist b, b1 P B such that |a| b |a1 | b1 . Furthermore each
such equation can correspond to at most 4 pairs pa, a1 q P A
where H aa ¤ k. So 4t where t is from Lemma 1 is an
upper bound on the number of pairs pa, a1 q P A where the
height of the ratio is ¤ k.
We will need the following theorem proved in [9].
¸
Ap
ξ G
i1
Letting ε 1, z
¸ χx ξ χx ξ
P ¸
p q{ ¸
{
{p
q
χx ξ
χx ξ
¤
|
G|
¸d
kr
1
2r 1
p
r pxqq
2
.
(5)
x 1
Combing (1), (2), (3), and (5) gives
3
t ¤ | G|
|
A|
A p
|
{
q{
r 1 r
¸
|
1 1 r
|
G|
G|
1 r
dr pnqq
p
2
¸d
p
{
1 r
kr
2r 1
{
|
kr
1
r pnqq
2
n 1
,
n 1
as desired.
In the reductions in Section IV we will be interested in the
sum of the reciprocals of the heights studied in Theorem 3.
Corollary 4 gives an upper bound on this quantity.
2
Corollary 4. Let A € Q be a set of n positive rational
numbers. Then:
?
1
O p plog nqplog log 3nqq
¤ n2
.
a
H a
a,a PA
¸
(1)
ξ G
1
348
347
1
Proof: First we
prove (9). Fix integers a, b P Z that sat1
1
1
isfy |a| , |b| ¤ Hp rxsp ry sp
and rxsp ry sp rasp rbsp .
Then p | xb ay. Either |xb ay | p or |xb ay | ¥ p.
If |xb ay | p then since p | xb ay we have
xb ay
1
x
a
¥ H xy ,
and y b . This implies that Hp rxsp ry sp
and so (9) holds. Otherwise |xb ay | ¥ p. By the triangle
1
inequality p ¤ |xb ay | ¤ Hp rxsp ry sp p|x| |y |q and
therefore (9) also holds in this case.
X? \
Now! we turn to proving (10).
p and
) Let z 1
S r0sp x, r1sp x, . . . , rz sp x , then S contains z
distinct elements. Write S as S ts0 , s1 , . . . , sz u where the
elements of S are ordered such
°z that ιps0 q . . . ιpsz q. Let
sz 1 s0 , then we have i0 ιpsi 1 q ιpsi q p. Hence
?
there must exist an index i such that si 1 si ¤ z p 1 p.
By the definition of S we have that si 1 si tx for
?p. So we can write
some t P Zp with |t| ¤ z ?
1
x tpsi 1 si q with |t| , |si 1 si | p and hence
?
Hp pxq p which proves (10).
We will need the following technical lemma. Lemma 3
uses the fact that that the numbers rism x1 for i 0, 1, . . . , ιpxq 1 are essentially evenly spaced in Zm for
any x P Zm . This implies that given an interval I only a
I|
fraction of about |m
of these numbers can be contained in
I.
Proof: For k a positive integer let Vk be the subset of
A2 containing pairs where the height of the ratio is a most
k, i.e.:
)
!
a
Vk pa, a1 q P A | H 1 ¤ k .
a
For convenience let V0 H. Then we know that:
¸
1
a,a1 PA
H
¸ 1
8
a
a1
k 1
¸
8
Firstly we note that:
¸
8
k n 1
¤
|V k | ¸
8
k n 1
n2 1
k
k
| V k zV k 1 |
|V k | k11
1
k
.
(6)
k 1 1 nn 1 n .
(7)
k 1
k11
1
k
2
For any
? k ¤ n we see by Theorem 3 that |Vk |
O p plog nqplog log 3nqq
nk2
. Hence:
ņ
k
ņ
|Vk | k 1 ¤ |Vkk2 |
1
? log n log log 3n k 1
O
¤ n2
p
p
1
k
1
¤
qp
qq
.
Lemma 3. Let m be?a positive integer and et x, y P Zm be
elements satisfying m ¥ |x| ¥ |y | and gcdp|x| , |y |q 1
and let I € Zm be an interval. The number of elements
||x|
r P r|x|s such that rrsm yx1 P I is at most |Im
2.
(8)
Combining (6), (7), and (8) yields the desired conclusion.
Proof: If ιpxq |x| we can replace x and y by x and
y, so wlog assume that ιpxq |x|.
(
1
!QLet A
U rism x ) | i P t0, 1, . . . , ιpxq 1u and B im
| i P r|x|s . First we prove that A B. Since
x
m
IV. R EDUCTION
The goal of this section is to prove Theorem 1.
! For x P Zp , r P N we)let I px, rq denote the set
x, x r1sp , . . . , x rr 1sp . A subset A € Zp is called
an interval if it is on the form I px, rq. A subset B € Zp is
called a generalized interval if B xA for some interval
A and x P Zp . For x P Zp we let |x| be defined as
|
|x| min tιpxq, ιpxqu
For x P Zp we let Hp pxq be the restricted height defined
m
hence ιpiqx1 P B, so we conclude that A € B.
Say that there are k ¥ 2 elements r P r|x|s such that
ryx1 P I. (If k ¤ 1 there is nothing to prove.) If |y | ιpy q
we let I 1 I t0, 1, . . . , |y | 1u and otherwise we let
I 1 I t0, 1, . . . , |y | 1u. We note that |I 1 | |I | |y |. For
each r P r|x|s that satisfies rrsm yx1 P I there is an element
a P A such that a P I 1 . Since I 1 is an interval there exists r P
Zm , s P Z such that I 1 tr, r r1sm , . . . , r rs 1sm u.
Since I 1 contains k elements from B the interval rr, r sq
contains k elements on the form |ip
x| , i P Z, and hence s ¥
pk 1q |xp| . Using that s |I 1 | |I | |y| we get that:
by:
Hp pxq min max t|a| , |b|u | a, b P Zp , x ab1
(
We note that Hp pxq Hp x1 . In Lemma 2 it is proved
?
that Hp pxq ¤ p. The following lemma also shows that if
x, y is much smaller than p, then if the height of xy is large
then the restricted height is large as well.
Lemma 2. Let x, y
Hp
P Z be integers not divisible by p, then:
"
rxsp rysp ¥ min |x| |y| , H
?
Hp pxq p .
1
p
*
x
y
,
|
A and B are finite sets of the same cardinality it suffices
to prove that A € B. Let i P t0, 1, . . . , ιpxq 1u and
j P r|x|s be the unique integer satisfying |x| | mj
Q U ιpiq.
mj ιpiq
mj ιpiq
1
Then
ιpiqx and |x| mj
and
|x|
|x|
(9)
k
(10)
349
348
¤ I |mx|
1
1 |I | |x| |x| |y|
m
m
1¤
|I | |x|
m
2.
By Lemma 3 we have that |Sy X I | is bounded by |Ip|t
so we get
Lemma 4 below show how we can connect the concept of
heights to linear hash functions. Given elements x, y P Zp it
gives an upper bound on the probability that pax, ay q P I 2 ,
when a P Zp is chosen uniformly at random. If the random
variables ax and ay had been independent (they are clearly
2
not!) then the probability would have been exactly p|I | {pq .
1
Lemma 4 shows that when the restricted height of xy is
2
large the probability is close to p|I | {pq .
Lemma 4. Let x, y P Z and I € Z be a generalized
interval. Then:
tP0 ¤
P ppax, ay q P I 2 q ¤ pP pax P I qq
2
O
P pax P I q
Hp pxy 1 q
1
p
,
tP0 ιpxqP0 E
¸
P ph̃py q, h̃pz qq P I 2 | h̃pxq v
¤
sPS
psx, sy q P I
(11)
2
|I |
m
n 1
o p1 q
|I |
m
O p p 1 q .
(12)
If I consists of a single element the statement also holds if
we replace h̃ with h.
Proof: Fix y, z P X z txu. Later in the proof we will
unfix y, z and consider them to be random variables. Let
J h̃1 pI q, then J is an interval. (If I is a singleton, then
1
h pI q is a generalized interval, showing that the statement
also holds if we replace h̃ with h in this case.) Let u P
h̃1 pv q. We will show that (12) holds when we condition on
hpxq u instead of h̃pxq v. Since u is arbitrarily chosen
this will be sufficient. If hpxq u and ph̃py q, h̃pz qq P I 2
then phpy q hpxq, hpz q hpxqq P pJ uq2 , i.e. papy xq, apz xqq P pJ uq2 . Since h is 2-independent hpxq and
hpy q hpxq apy xq are independent, showing that hpxq
and a are independent. So we conclude that:
)
2
Theorem 5. Let U rus be a universe of keys and X € U a
set of n keys such that p ¡ un. Let x P X, v P rms be a hash
value and let I € rms be an interval when considered as
a subset of Zm . Let y, z P X z txu be chosen independently
and uniformly at random, then:
P ph̃py q, h̃pz qq P I 2 | hpxq u
Let S a, r1sp x1 , . . . , rιpxq 1sp x1 . By linearity of
expectation we have:
which finishes the proof as P pax P I q |pI | .
Instead of proving Theorem 1 we will prove a slightly
more general theorem from which Theorem 1 follows.
The idea of the proof is the following (assuming that
I is an interval for simplicity). By multiplying x, y with
an appropriate factor we can wlog assume |y | ¤ ιpxq ¤
?
Hp pxy 1 q p. Instead of directly calculating the probability
that pax, ay q P I 2 we) consider the set S !
a, r1sp x1 , . . . , rιpxq 1sp x1 and upper bound the expected number of elements s P S satisfy psx, sy q P I 2 . By
linearity of expectation this will allow us to upper bound
the desired probability. This turns out to be easier for the
following reason. The set Sx is an interval and hence, for
most values of a, it is either contained in I or disjoint from
I. When it is disjoint from I there are no elements s P S
such that psx, sy q P I 2 . When Sx is contained in I we just
need to calculate |Sy X I |, which we do by using Lemma 3.
The details are given below.
Proof: Firstly, we see that I zJ for some interval J
and z P Zp . Since pax, ay q P I I iff paxz, ayz q P J J
we can replace x, y, I with xz, yz, J respectively and wlog
assume that I is an interval in the following.
For any P Zp the probability that pax, ay q P I I
does not change if we exchange x and y with x and y
respectively. Let Hp pxy 1 q t, then there exists such
that |x| , |y| ¤ t. Since we may also swap x and y we can
?
wlog assume that |y | ¤ ιpxq t p.
We let T be the set of elements s P Zp such that psx, sy q P
I 2 , i.e. T x1 I X y 1 I. Let a P Zp be chosen uniformly
at random and let P0 be the probability that pax, ay q P I 2 ,
!
2 P pSx X I Hq .
|I | t
|I | 2
p
t
p
p
2
|I |
|I | 2
,
3
p
pt
p
P0 ¤
where a P Zp is chosen uniformly at random.
P0 P ppax, ay q P I 2 q .
|I | t
p
Since Sx is an interval consisting of t elements, we have that
the probability that Sx X I H less than |I |p t . Inserting
this gives
p
p
2,
P papy xq, apz xqq P pJ uq2 | hpxq u
P papy xq, apz xqq P pJ uq2 .
(13)
,
By Lemma 4 we get:
P papy xq, apz xqq P pJ uq2
Clearly the number of elements s P S such that psx, sy q P I 2
is bounded by |Sx X I | and |Sy P XI |. Inserting this gives:
¤
tP0 ¤ E pmin t|Sx X I | , |Sy X I |uq .
350
349
|J u|
p
2
O
Hp
1yx |J u|
z x
p
1
.
p
(14)
I|
Since |J u| p ¤ |m
1
(14) to conclude that:
P
p
¤
1
p
h̃py q, h̃pz qq P I 2 | h̃pxq v
I
m
| |
2
1
O
Hp
moment bounds. If h̃ had been 3-independent the upper
bound in (20) could have been replaced with O δ 2 . In
2
n
term,
most cases the nop1q term in (20) will dominate the pμ
and we will be able to prove bounds that are only worse
by a factor of nop1q than the corresponding bounds for 3independent hash functions.
we can combine (13) and
y x
z x
I
m
1
.
p
| |
(15)
Theorem 6. Let U rus be a universe of keys and X € U
a set of n keys such that p ¡ un. Let°v P rms, I € rms be
an interval and x0 P U zX. Let A xPX rh̃pxq P I s. Then
μ E pAq satisfies
Now we unfix y, z and consider y, z P X z txu to be chosen
uniformly at random. By averaging over (15) we get:
P
I
m
| |
p
h̃py q, h̃pz qq P I 2 | h̃pxq v
¸
1
O
pn 1q2
2
y 1 ,z 1 P
X ztxu
Hp
I
| |
y 1 x
z 1 x
¤
1
. (16)
p
m
μ E pA | h̃px0 q v q Θ n and for every δ
P
By Lemma 2 we see that:
Hp
1
y 1 x
z 1 x
!
¤
1
min Ωpnq, H
¤
O
1
n
H
y 1 x
z 1 x
1
y 1 x
z 1 x
)
.
n
O p m 2
n2
o p1 q
q n 2
o p1q
2
Letting σ 2
|
J|
p
op1q
n
O
n2
pμ
,
(19)
.
(20)
2
n1
o p1 q
|
J|
p
O n 2 p 1 .
(21)
V pA | h̃px0 q v q, we get:
σ2
E A2 | h̃px0 q v
μ nop1q
O
n2
pμ
μ2
.
By an application of Chebyshev’s Inequaliy we now get
Equation (20).
Consider linear probing. In [4] it is proved that when
using a 3-independent hash function the worst case expected
query time for any element is O plog nq. This is done by
fixing the hash value of the element we query for, and then
using second order moment bounds. By using Theorem 6
instead of the usual second order moment bound for 3independent hash functions we get an upper bound that is
nop1q Oplog nq nop1q instead. This proves Corollary 3.
Similarly, we can prove that h̃ performs almost as well
as 3-independent hash functions when using it for minwise hashing. This is done in Corollary 2 below. We should
1 o p1 q
compare
to the upper bound of
the upper bound of n
log n
O n
that holds for 3-independent hash functions [5].
Proof: Fix h. Choosing x, y, z P X randomly the
probability
3 that all of the keys are in the same largest bucket
is M
. But the average probability of this over all choices
n
of h is Opm2 n2 op1q q. Hence:
Corollary 1. Let U rus be a universe of keys and X € U
a set of n keys such that p ¡ un. Assume that m ¥ n and
let M be the number of keys that
popular
hash to the most (
hash value, i.e. M maxvPrms x P X | hpxq v . Then
E pM q ¤ n1{3 op1q . The same result holds when h is
replaced with h̃ in the definition of M .
E
?
A μ| ¥ δ μ | h̃px0 q v
E A2 | h̃px0 q v
In this section we apply the results from Section IV to
show performance guarantees when using h and h̃ for hash
tables with chaining, for min-wise hashing and for linear
probing.
We first prove that Corollary 1 is a corollary of Theorem 1.
The idea in the proof is that if there exists a large bucket,
then there must be a high probability that three random keys
have the same hash values.
3 |
0 it holds that:
Proof: (19) follows from the 2-independence of h̃. By
Theorem 5:
(17)
V. A PPLICATIONS
M
n
1
¤
δ2
We now see that Corollary 4 applied to (17) in combination
with (16) gives the desired upper bound.
¡
I
m
| |
. (18)
Corollary 2. Let U rus be a universe of keys and X
a set of n keys such that p ¡ un. Let x0 P X. Then:
Applying Jensen’s inequality to the convex function x Ñ x3
now gives E pM q n1{3 op1q as desired.
The following restatement of Theorem 2 shows that even
if we fix one hash value we can still use second order
P
351
350
h̃px0 q min
xPX ztx0 u
!
h̃pxq
)
¤
n 1
o p1 q
.
€
U
Proof: Let X 1 X z tx0 u. For v P rms let Iv
and let Av be the random variable defined by:
Av
1s
¸ rh̃pxq ¤ vs ¸ rh̃pxq P I s .
! )
[7] E. Bombieri and W. Gubler, Heights in Diophantine geometry.
Cambridge University Press, 2007, vol. 4.
[8] T. Tao, “Lecture notes 2 for 254a.” [Online]. Available:
https://www.math.cmu.edu/ af1p/Teaching/AdditiveCombinatorics/Tao.pdf
v
x PX 1
x PX 1
Now we note that h̃px0 q minxPX
v
rv
h̃pxq iff there exists
1
[9] K. K. Norton, “Upper bounds for sums of powers of divisor
functions,” Journal of Number Theory, vol. 40, no. 1, pp. 60–
85, 1992.
P rms such that h̃px0 q v and Av 0. So we get:
P
! )
h̃px0 q min h̃pxq
¸ P A 0 | h̃px q v P h̃px q v .
x PX 1
v
0
0
v Prms
P h̃px q v O m
aSince
E pA q gives:
! )
P h̃px q min h̃pxq
¸ " n
¤O 1 min 1,
0
1
Theorem 6 with δ
v
0
x PX 1
1
m
n
1
o p1 q
v Prms
O
?1p n
v
1
o p1 q
m
1
o p1 q
O
m2
ppv 1q2
*
.
ACKNOWLEDGMENT
The author would like to thank the anonymous referees
for the most thorough review he has ever received. The
author also thanks Rikke Langhede for introducing him
to the concept of heights, and Mikkel Thorup for helpful
comments.
R EFERENCES
[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to Algorithms (3. ed.). MIT Press, 2009.
[2] L. Carter and M. N. Wegman, “Universal classes of hash
functions,” J. Comput. Syst. Sci., vol. 18, no. 2, pp. 143–154,
1979.
[3] I. Baran, E. D. Demaine, and M. Pǎtraşcu, “Subquadratic
algorithms for 3SUM,” Algorithmica, vol. 50, no. 4, pp. 584–
596, 2008.
[4] M. Pǎtraşcu and M. Thorup, “On the k-independence required
by linear probing and minwise independence,” ACM Trans.
Algorithms, vol. 12, no. 1, p. 8, 2016.
[5] M. B. T. Knudsen and M. Stöckel, “Quicksort, largest bucket,
and min-wise hashing with limited independence,” in Algorithms - ESA 2015 - 23rd Annual European Symposium, Patras,
Greece, September 14-16, 2015, Proceedings, 2015, pp. 828–
839.
[6] N. Alon, M. Dietzfelbinger, P. B. Miltersen, E. Petrank, and
G. Tardos, “Linear hash functions,” Journal of the ACM
(JACM), vol. 46, no. 5, pp. 667–683, 1999.
352
351