Moreover, for any (including ):
If we have a given distinguished query key q ∈ [u], the above
bounds hold even if we condition everything on h(q) = a for any
given a ∈ [u]a.
The statement of Theorem 5 may seem a bit cryptic, so
before proceeding, we will show that it improves the simple
tabulation bounds from Theorem 1. Those bounds considered a set S of n balls or keys mapped into m bins, and had an
error bound of m−γ. The error is here improved to u−γ. This is
important when m is small, for example, if m = 2 corresponding to unbiased coin tosses.
We now do the translation. Discounting all irrelevant
keys x ∈ [u]\S, we zero their value functions, setting vx(⋅) = 0.
Also, we define the bin hash function from Theorem 1 as
h′(x) = h(x) mod m, noting that m ↑ u since both are powers of two. Theorem 1 studies the number of balls landing
a bin b which may be a function of the bin h′(q) of a query
ball q ∈ [u]\S. Thanks to the last statement of Theorem 5,
we can condition on any value a of h(q), which determines
h′(q) and hence b. Now, for x ∈ S, define Vx = vx(h(x) ) = 1 if
h′(x) = b; 0 otherwise. Now, V = Σx∈[u] Vx is the variable X from
Theorem 1, and ( 5) and ( 6) follow from ( 9) and ( 10), but with
the improved error u−γ.
A different illustration of the versatility of Theorem 5 is
if we want each key x ∈ [u] to be sampled with some individual sampling probability px. In this case, we have no distinguished query key, and we can just define vx(y) = 1 if y < px
⋅ u; 0 otherwise. Since h(x) is uniform in [u], we have that x is
sampled with Vx = vx(h(x)) = 1 with probability p˜x = épxmù/m.
The number V = Σx∈[u] Vx of samples is now concentrated
according to ( 9) and ( 10).
3. 2. Minwise independence
Concerning minwise hashing, Dahlgaard and Thorup9 have
proved that twisted tabulation yields the following strengthening of Theorem 4 for simple tabulation.
Theorem 6 (Dahlgaard and Thorup9). Consider a set S ⊆ Σc
of n = |S| keys and q ∈ S. If h : Σc → [m], m ≥ nu1/c, is implemented with twisted tabulation, then Pr[h(q) = min h(S)] =
( 1 + Õ(1/u1/c)/n.
The important difference is that the bias Õ(1/n1/c) from
Theorem 4 is replaced by Õ(1/u1/c) which is small regardless of the set size. Such an absolutely small bias generally
requires Ω(log u)-independence. 20
3. 3. Short range amortization for hash tables
We now switch to a quite different illustration of the
power of twisted tabulation hashing from Ref. 22 Consider
linear probing in a half-full hash table with n keys. Out
of operations, we expect some to take Ω(log n) time.
Nevertheless, we show that any window of log n
operations on distinct keys is executed in O(lg n) time with
high probability. This also holds for the simpler case of
The general point is that for any set of stored keys and any
set of window keys, the operation times within the window
are sufficiently independent that the average concentrates
nicely around the expected constant operation time. Such
concentration of the average should not be taken for granted
with real hash functions. In Ref. 20 are input examples for linear probing with fast two-independent hashing such that if
one operation is slow, then most operations are slow. Ref. 22
presented a parallel universe construction causing similar
problems for simple tabulation. As stated above, twisted
tabulation does, however, provide sufficient independence,
and we expect this to prove useful in other applications.
3. 4. Pseudorandom numbers generators
Like any hash function, twisted tabulation naturally implies
a PRG with the pseudorandom sequence h(0), h( 1), ... For
maximal efficiency, we use the last and least significant
character as the head. Thus, a key x = (xc− 1, .. ., x0) ∈ Σc has
head(x) = x0 and tail(x) = x>0 = (xc− 1, . . ., x1). For twisted tabulation, we use a simple tabulation function h : Σc− 1 → Σ × [2r]
and a character function h0 : Σ → [2r], and then h(x) is computed, setting (t, h>0) = h(x>0) and h(x) = h0[x0 ⊕ t]⊕h>0. We
now think of x as the pair (x>0, x0) ∈ Σc− 1 × Σ. As we increase the
indexx = 0, 1, . . ., Σ − 1, Σ, Σ + 1, . . . = (0, 0), (0, 1), . . ., (0, Σ − 1),
( 1, 0), ( 1, 1), ..., the tail x>0 only increases when x0 wraps
around to 0—once in every Σ calls. We, therefore, store (t, h>0)
= h(x>0) in a register, recomputing it only when x>0 increases.
Otherwise, we compute h(x) = h0[x0 ⊕ t] ⊕ h>0 using just one
character lookup and two ⊕-operations. In Ref. 22 this was
found to be exceedingly fast: as fast as a single multiplication and four times faster than the standard random number generator random() from the GNU C library which has
almost no probabilistic guarantees. Besides being faster,
the twisted PRG offers the powerful distributional guarantees discussed above.
As an alternative implementation, we note that h is
itself applied to consecutive numbers x>0 = 0, 1, 2, ..., so
h can also be implemented as a PRG. The h-PRG is only
applied once for every Σ numbers generated by h, so the
h-PRG can be much slower without affecting the overall
performance. Instead of implementing h by simple tabulation, we could implement it with any logarithmically
independent PRG, thus not storing any tables for h, but
instead generating each new value h(x>0) on the y as x>0
increases. We can view this as a general conversion of a
comparatively slow but powerful PRG into an extremely
fast one preserving the many probabilistic properties of
3. 5. Randomized algorithms and data structures
When using the twisted PRG in randomized algorithms, 16
we get the obvious advantage of the Chernoff-style bounds
from Theorem 5 which is one of the basic techniques
needed. 16, Section 4 The ε-minwise hashing from Theorem 6 with
ε = Õ(1/u1/c) is important in contexts where we want to
a The last statement conditioning on h(q) = a was not proved in Ref., 22, but is
an easy extension using ideas from Ref. 21