JULY 2017 | VOL. 60 | NO. 7 | COMMUNICATIONS OF THE ACM 97
functions w.h.p., and later every query runs in constant
worst-case time with exactly two probes. We note that even
though cuckoo hashing requires two independent hash
functions, these essentially come for the cost of one in simple tabulation: the pair of hash codes can be stored consecutively, in the same cache line so that we can look up both for
the price of one.
In the dynamic case, Theorem 3 implies that we expect
Ω(n4/3) updates between failures requiring a complete rehash
with new hash functions.
2. 4. ε-Minwise independence
A hash function h : [u] → [m] is ε-minwise independent,
or minwise with bias ε, if for any set S ⊆ [u] and q ∈ S,
. The classic application of
ε-minwise hashing of Broder et al. 4 is the estimation of Jaccard
set similarity |A ∩ B|/|A ∪ B|. More precisely, ignoring the
probability of collisions in the minimum hash value, we get
For better concentration on the estimate, we would make
multiple experiments with independent hash functions, yet
this cannot eliminate the bias ε.
To get minwise bias ε, we generally need a Θ(log
1/ε)-independent hash function. 12, 20 However, Pa ˇtras˛cu and
Theorem 4 (Pa ˇtras˛cu and Thorup21). Consider a set S ⊆ Σc
of n = |S| keys and q ∈ S. If h : Σc → [m], m ≥ n1+1/c, is
implemented by simple tabulation, then Pr[h(q) = min h(S)] =
( 1 ± Õ(1/n1/c) )/n.
2. 5. The power of two choices
The power of two choices is a standard scheme for placing
balls into bins where each ball hashes to two bins, and is
placed in the lightest loaded one. When placing n balls into n
bins, using the two-choice paradigm with truly random hash
functions, the maximum load of any bin is lg lg n + O( 1) w.h.p. 2
Dahlgaard et al. 8 have proved that simple tabulation gives a maximum load which is lg lg n + O( 1) in expectation and O(log log n)
w.h.p. This is the simplest constant time hashing scheme
known to offer such strong two-choice load balancing.
2. 6. Weakness with small numbers
As described above, simple tabulation has much more power
than suggested by its three-independence. However, there
are also some weaknesses. For example, in the Chernoff-style bounds ( 5) and ( 6) from Theorem 1, we have an additive
error probability of 1/mγ when hashing into m bins. Here γ is
an arbitrarily large constant, so this is fine when m is large.
However, this is not good if m is small, for example, m = 2 as
when we toss a coin. A related problem from Theorem 4 is
that the minwise bias Õ(1/n1/c) depends on the size n of the
set considered. This is fine if the set is large, but not if the set
unmatched simplicity and efficiency. On modern architec-
tures, access to memory is done in cache lines (of much
more than a word), so inspecting a few consecutive values
typically translates into a single memory access. Even if the
scan straddles a cache line, the behavior will still be better
than a second random memory access on architectures with
Linear probing was shown to take expected constant time
for any operation in 1963 by Knuth, 13 in a report which is now
regarded as the birth of algorithm analysis. This analysis,
however, assumed a truly random hash function. However,
Pagh et al. 18 showed that just five-independence suffices
for this expected constant operation time. In Ref., 20
five-independence was proved necessary with a concrete combination of keys and a four-independent random hash function
where searching certain keys takes Ω(log n) expected time.
In Ref., 21 the result from Ref. 18 is strengthened for more
filled linear probing tables, showing that if the table size
is m = ( 1 + ε)n, then the expected time per operation is O(1/ε2),
which asymptotically matches the bound of Knuth13 with
truly random hashing. More important for this paper,
Pa ˇtras˛cu and Thorup21 proved that this performance bound
also holds with simple tabulation hashing.
In fact, for simple tabulation, we get quite strong concentration results for the time per operation, for example, constant
variance for constant ε. For contrast, with five-independent
hashing, the variance is only known to be O(log n). 18
Experiments are done in Ref. 21 comparing simple tabulation with standard two-independent hashing schemes
in linear probing. For simple inputs such as consecutive
integers, the performance was extremely unreliable with the
two-independent hashing, but with simple tabulation, everything worked perfectly as expected from the theoretical
2. 3. Cuckoo hashing
In cuckoo hashing, 19 we use two tables of size m ≥ ( 1 + ε)
n and independent hash functions h0 and h1 mapping the
keys to these two tables. Cuckoo hashing succeeds if we
can place every key in one of its two hash locations without
any collision. We can think of this as a bipartite graph with
a set for each table and an edge (h0(x), h1(x) ) for each key x.
Cuckoo hashing fails exactly if this graph has a component
with more edges than vertices. With truly random hashing,
this bad event happens with probability . Pa ˇtras˛cu and
Thorup21 study the random graphs induced by simple tabulation, and obtain a rather unintuitive result: the worst failure probability is inversely proportional to the cube root of
the set size.
Theorem 3 (Pa ˇtras˛cu and Thorup21). Any set of n keys can
be placed in two tables of size m = ( 1 + ε) by cuckoo hashing and
simple tabulation with probability 1 − O(n−1/3). There exist sets
on which the failure probability is Ω(n−1/3).
Thus, cuckoo hashing with simple tabulation is an excellent construction for a static dictionary. The dictionary can
be built (in linear time) after trying O( 1) independent hash