(a) The probability that gj( p) = gj( q) for a fixed j. Graphs are
shown for several values of k. In particular, the blue function
(k = 1) is the probability of collision of points p and q under a single random hash function h from the LSH family.
(b) The probability that gj ( p) = gj ( q) for some j = 1…L. The probabilities are shown for two values of k and several values of L.
Note that the slopes are sharper when k is higher.
Fig. 3. The graphs of the probability of collision of points p and q as a function of the distance between p and q for different values
of k and L. The points p and q are d = 100 dimensional binary vectors under the Hamming distance. The LSH family H is the one
described in Section 2. 3.
A practical approach to choosing k was introduced in the E2LSH
package [ 2]. There the data structure optimized the parameter k as a
function of the dataset and a set of sample queries. Specifically, given
the dataset, a query point, and a fixed k, one can estimate precisely the
expected number of collisions and thus the time for distance computations as well as the time to hash the query into all L hash tables. The
sum of the estimates of these two terms is the estimate of the total
query time for this particular query. E2LSH chooses k that minimizes
this sum over a small set of sample queries.
3 LSH Library
To date, several LSH families have been discovered. We briefly survey
them in this section. For each family, we present the procedure of
chosing a random function from the respective LSH family as well as
its locality-sensitive properties.
Hamming distance. For binary vectors from {0, 1} d, Indyk and
Motwani [ 25] propose LSH function hi(p) = pi, where i { 1,…d} is a
randomly chosen index (the sample LSH family from Section 2. 3).
They prove that the exponent is 1/c in this case.
It can be seen that this family applies directly to M-ary vectors (i.e.,
with coordinates in { 1…M}) under the Hamming distance. Moreover,
a simple reduction enables the extension of this family of functions to
M-ary vectors under the l1 distance [ 30]. Consider any point p from
{ 1…M}d. The reduction proceeds by computing a binary string
Unary(p) obtained by replacing each coordinate pi by a sequence of pi
ones followed by M – pi zeros. It is easy to see that for any two M-ary
vectors p and q, the Hamming distance between Unary(p) and
Unary(p) equals the ll1 distance between p and q. Unfortunately, this
reduction is efficient only if M is relatively small.
l d 1 distance. A more direct LSH family for under the l1 distance
is described in [ 4]. Fix a real w R, and impose a randomly shifted
grid with cells of width w; each cell defines a bucket. More specif-
ically, pick random reals s 1, s 2,…sd [0, w) and define hs = 1,…sd
( (x1 – s1)/w ,…, (xd – sd)/w ). The resulting exponent is equal to
= 1/c + O (R/w).
ls distance. For the Euclidean space, [ 17] propose the following LSH
family. Pick a random projection of d onto a 1-dimensional line and
chop the line into segments of length w, shifted by a random value
b [0, w). Formally, hr, b = ( (r·x + b)/ w , where the projection vector
r d is constructed by picking each coordinate of r from the Gaussian
distribution. The exponent drops strictly below 1/c for some (carefully
chosen) finite value of w. This is the family used in the [ 2] package.
A generalization of this approach to ls norms for any s [0, 2) is
possible as well; this is done by picking the vector r from so-called
s-stable distribution. Details can be found in [ 17].
Jaccard. To measure the similarity between two sets A, B U (
containing, e.g., words from two documents), the authors of [ 9, 8] utilize the
Jaccard coefficient. The Jaccard coefficient is defined as s(A, B) = A B A B .
Unlike the Hamming distance, Jaccard coefficient is a similarity measure: higher values of Jaccard coefficient indicate higher similarity of the
sets. One can obtain the corresponding distance measure by taking
d(A, B) = 1 – s(A, B). For this measure, [ 9, 8] propose the following
LSH family, called min-hash. Pick a random permutation on the ground
universe U. Then, define hπ(A) = min{π(a) a A}. It is not hard to
prove that the probability of collision Prπ[hπ(A) = hπ(B)] = s(A, B). See
[ 7] for further theoretical developments related to such hash functions.
Arccos. For vectors p, q d, consider the distance measure that is
the angle between the two vectors, (p, q) = arccos p·q p · q . For this
distance measure, Charikar et al. (inspired by [ 20]) defines the following LSH family [ 14]. Pick a random unit-length vector u d and
define hu(p) = sign(u·p). The hash function can also be viewed as partitioning the space into two half-spaces by a randomly chosen hyperplane.
Here, the probability of collision is Pru[hu(p) = hu(q)] = 1 – (p, q)/π.