To illustrate the concept, consider the following example. Assume
that the data points are binary, that is, each coordinate is either 0 or 1.
In addition, assume that the distance between points p and q is computed according to the Hamming distance. In this case, we can use a
particularly simple family of functions H which contains all projections of the input point on one of the coordinates, that is, H contains
all functions h d i from {0, 1} to {0, 1} such that hi(p) = pi. Choosing
one hash function h uniformly at random from H means that h(p)
returns a random coordinate of p (note, however, that different applications of h return the same coordinate of the argument).
To see that the family H is locality-sensitive with nontrivial parameters, observe that the probability PrH [h(q) = h(p)] is equal to the fraction of coordinates on which p and q agree. Therefore, P1 = 1 – R/d,
while P2= 1 – cR/d. As long as the approximation factor c is greater
than 1, we have P1 > P2.
2. 4 The Algorithm
An LSH family H can be used to design an efficient algorithm for
approximate near neighbor search. However, one typically cannot use
H as is since the gap between the probabilities P1 and P2 could be
quite small. Instead, an amplification process is needed in order to
achieve the desired probabilities of collision. We describe this step
next, and present the complete algorithm in the Figure 2.
Given a family H of hash functions with parameters (R, cR, P1, P2)
as in Definition 2. 3, we amplify the gap between the high probability
P1 and low probability P2 by concatenating several functions. In particular, for parameters k and L (specified later), we choose L functions
gj(q) = (h1, j(q),…,hk, j(q)), where ht, j ( 1 ≤ t ≤ k, 1 ≤ j ≤ L) are chosen
independently and uniformly at random from H. These are the actual
functions that we use to hash the data points.
The data structure is constructed by placing each point p from the
input set into a bucket gj(p), for j = 1,…,L. Since the total number of
buckets may be large, we retain only the nonempty buckets by resorting to (standard) hashing3 of the values gj(p). In this way, the data
structure uses only O (nL) memory cells; note that it suffices that the
buckets store the pointers to data points, not the points themselves.
To process a query q, we scan through the buckets g1(q),…, gL(q), and
retrieve the points stored in them. After retrieving the points, we com-
3See [ 16] for more details on hashing.
pute their distances to the query point, and report any point that is a valid
answer to the query. Two concrete scanning strategies are possible.
1. Interrupt the search after finding the first L points (including
duplicates) for some parameter L .
2. Continue the search until all points from all buckets are
retrieved; no additional parameter is required.
The two strategies lead to different behaviors of the algorithms. In
particular, Strategy 1 solves the (c, R)-near neighbor problem, while
Strategy 2 solves the R-near neighbor reporting problem.
Strategy 1. It is shown in [ 25, 19] that the first strategy, with
L = 3L, yields a solution to the randomized c-approximate R-near
neighbor problem, with parameters R and for some constant failure
probability < 1. To obtain this guarantee, it suffices to set L to (n ),
where = ln 1/P1 ln 1/P [ 19]. Note that this implies that the algorithm runs in 2
time proportional to n which is sublinear in n if P1 > P2. For example,
if we use the hash functions for the binary vectors mentioned earlier,
we obtain = 1/c [ 25, 19]. The exponents for other LSH families are
given in Section 3.
Strategy 2. The second strategy enables us to solve the randomized
R-near neighbor reporting problem. The value of the failure probability
depends on the choice of the parameters k and L. Conversely, for
each , one can provide parameters k and L so that the error probability is smaller than . The query time is also dependent on k and L. It
could be as high as (n) in the worst case, but, for many natural datasets, a proper choice of parameters results in a sublinear query time.
The details of the analysis are as follows. Let p be any R-neighbor
of q, and consider any parameter k. For any function gi, the probability that g k i(p) = gi(q) is at least P1. Therefore, the probability that
g k L i(p) = gi(q) for some i = 1…L is at least 1 – ( 1 – P1) . If we set L =
log k L 1– Pk so that ( 1 – P ) ≤ , then any R-neighbor of q is returned by 1 1
the algorithm with probability at least 1 – .
How should the parameter k be chosen? Intuitively, larger values of
k lead to a larger gap between the probabilities of collision for close
points and far points; the probabilities are Pk k 1 and P2, respectively (see
Figure 3 for an illustration). The benefit of this amplification is that the
hash functions are more selective. At the same time, if k is large then
Pk 1 is small, which means that L must be sufficiently large to ensure
that an R-near neighbor collides with the query point at least once.
Preprocessing:
1. Choose L functions gj, j = 1,…L, by setting gj = (h1, j, h2, j,…hk, j), where h1, j,…hk, j are chosen at random from the LSH family H.
2. Construct L hash tables, where, for each j = 1,…L, the jth hash table contains the dataset points hashed using the function gj.
Query algorithm for a query point q:
1. For each j = 1, 2,…L
i) Retrieve the points from the bucket g th j(q) in the j hash table.
ii) For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer (cR-near
neighbor for Strategy 1, and R-near neighbor for Strategy 2).
iii) (optional) Stop as soon as the number of reported points is more than L .