Fast and Powerful Hashing
By Mikkel Thorup
Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired
probabilistic guarantees are often too complicated to be
practical. Here, we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly
Simple tabulation hashing dates back to Zobrist (A new
hashing method with application for game playing. Technical
Report 88, Computer Sciences Department, University of
Wisconsin). Keys are viewed as consisting of c characters and
we have precomputed character tables h1, ..., hc mapping
characters to random hash values. A key x = (x1, . . ., xc) is hashed
to h1[x1] ⊕ h2[x2]. . . . . ⊕ hc[xc]. This schemes is very fast with
character tables in cache. Although simple tabulation is
not even four-independent, it does provide many of the guarantees that are normally obtained via higher independence,
for example, linear probing and Cuckoo hashing.
Next, we consider twisted tabulation where one input
character is “twisted” in a simple way. The resulting hash
function has powerful distributional properties: Chernoff-style tail bounds and a very small bias for minwise hashing.
This is also yields an extremely fast pseudorandom number
generator that is provably good for many classic randomized algorithms and data-structures.
Finally, we consider double tabulation where we compose
two simple tabulation functions, applying one to the output
of the other, and show that this yields very high independence
in the classic framework of Wegman and Carter. 26 In fact,
w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully random hashing. We
also mention some more elaborate tabulation schemes getting
near-optimal independence for given time and space.
Although these tabulation schemes are all easy to imple-
ment and use, their analysis is not.
A useful assumption in the design of randomized algorithms and data structures is the free availability of fully random hash functions, which can be computed in unit time.
Removing this unrealistic assumption is the subject of a
large body of work. To implement a hash-based algorithm,
a concrete hash function has to be chosen. The space, time,
and random choices made by this hash function affects the
overall performance. The generic goal is therefore to provide
efficient constructions of hash functions that for important randomized algorithms yield probabilistic guarantees similar to
those obtained assuming fully random hashing.
To fully appreciate the significance of this program, we
note that many randomized algorithms are very simple and
popular in practice, but often they are implemented with
too simple hash functions without the necessary guaran-
tees. This may work very well in random tests, adding to
their popularity, but the real world is full of structured data,
for example, generated by computers, that could be bad for
the hash function. This was illustrated in Ref. 21 showing
how simple common inputs made linear probing fail with
popular hash functions, explaining its perceived unreliabil-
ity in practice. The problems disappeared when sufficiently
strong hash functions were used.
In this paper, we will survey recent results from Refs. 6–9,
21, 22, 25 showing how simple realistic hashing schemes based
on tabulation provide unexpectedly strong guarantees for
many popular randomized algorithms, for example, linear
probing, Cuckoo hashing, minwise independence, treaps,
planar partitions, power-of-two-choices, Chernoff-style
concentration bounds, and even high independence. The
survey is from a users perspective, explaining how these
tabulation schemes can be applied. While these schemes
are all very simple to describe and use, the analysis showing that they work is nontrivial. For this analysis, the
reader is referred to the above papers. The reader is also
referred to these papers for a historical account of previous work.
1. 1. Background
Generally a hash function maps a key universe U of keys
into some range R of hash values. A random hash function
h is a random variable from RU, assigning a random hash
value h(x) ∈ R to every x ∈ U. A truly random hash function
is picked uniformly from RU, assigning a uniform and independent hash value h(x) ∈ R to each key x ∈ U. Often randomized algorithms are analyzed assuming access to truly
random hash functions. However, just storing a truly random hash function requires |U| log2 |R| bits, which is unrealistic for large key universes.
The concept of k-independence was introduced by
Wegman and Carter26 in FOCS’ 79 and has been the cornerstone of our understanding of hash functions ever since.
As above, we think of a hash function h: [u] → [m] as a
random variable distributed over [m][u]. We say that h is
k-independent if (a) for any distinct keys x1, . . ., xk ∈ [u], the
hash values h(x1), ..., h(xk) are independent random variables;
and (b) for any fixed x, h(x) is uniformly distributed in [m].
As the concept of independence is fundamental to probabilistic analysis, k-independent hash functions are both
A previous version of this paper was published in the Proceedings of the
36th IARCS Conference on Foundations of Software Technology and Theoretical Computer Science (2016).