Currently, the new family is mostly of theoretical interest. This is
because the asymptotic improvement in the running time achieved via
a better separation of collision probabilities makes a difference only for
a relatively large number of input points. Nevertheless, it is quite likely
that one can design better pseudorandom point configurations which do
not suffer from this problem. Some evidence for this conjecture is presented in [ 3], where it is shown that point configurations induced by so-called Leech lattice compare favorably with truly random configurations.
Preliminaries
2. 1 Geometric Normed Spaces
We start by introducing the basic notation used in this article. First,
we use P to denote the set of data points and assume that P has car-dinality n. The points p from P belong to a d-dimensional space d. We
use pi to the denote the ith coordinate of p, for i = 1…d.
For any two points p and q, the distance between them is defined as
for a parameter s > 0; this distance function is often called the ls norm.
The typical cases include s = 2 (the Euclidean distance) or s = 1 (the
Manhattan distance) 2. To simplify notation, we often skip the subscript
2 when we refer to the Euclidean norm, that is, p – q = p – q 2.
Occasionally, we also use the Hamming distance, which is defined
as the number of positions on which the points p and q differ.
2. 2 Problem Definition
The nearest neighbor problem is an example of an optimization problem:
the goal is to find a point which minimizes a certain objective function
(in this case, the distance to the query point). In contrast, the algorithms
that are presented in this article solve the decision version of the problem. To simplify the notation, we say that a point p is an R-near neighbor
of a point q if the distance between p and q is at most R (see Figure 1).
In this language, our algorithm either returns one of the R-near neighbors or concludes that no such point exists for some parameter R.
Fig. 1. An illustration of an R-near neighbor query. The nearest
neighbor of the query point q is the point p1. However, both p1
and p2 are R-near neighbors of q.
Naturally, the nearest and near neighbor problems are related. It is
easy to see that the nearest neighbor problem also solves the R-near
2The name is motivated by the fact that p – q d 1 = i= 1 pi – qi is the length of the
shortest path between p and q if one is allowed to move along only one coordinate
at a time.
118 January 2008/Vol. 51, No. 1 COMMUNICA TIONS OF THE ACM
neighbor problem–one can simply check if the returned point is an
R-near neighbor of the query point. The reduction in the other direction is somewhat more complicated and involves creating several
instances of the near neighbor problem for different values of R. During
the query time, the data structures are queried in the increasing order
of R. The process is stopped when a data structure reports an answer.
See [ 22] for a reduction of this type with theoretical guarantees.
In the rest of this article, we focus on the approximate near neighbor problem. The formal definition of the approximate version of the
near neighbor problem is as follows.
Definition 2. 1 (Randomized c-approximate R-near neighbor, or
(c, R) – NN). Given a set P of points in a d-dimensional space d , and
parameters R > 0, > 0, construct a data structure such that, given
any query point q, if there exists an R-near neighbor of q in P, it reports
some cR-near neighbor of q in P with probability 1 – .
For simplicity, we often skip the word randomized in the discussion. In these situations, we will assume that is an absolute constant
bounded away from 1 (e.g., 1/2). Note that the probability of success
can be amplified by building and querying several instances of the data
structure. For example, constructing two independent data structures,
each with = 1/2, yields a data structure with a probability of failure
= 1/2·1/2 = 1/4.
In addition, observe that we can typically assume that R = 1.
Otherwise we can simply divide all coordinates by R. Therefore, we
will often skip the parameter R as well and refer to the c-approximate
near neighbor problem or c-NN.
We also define a related reporting problem.
Definition 2. 2 (Randomized R-near neighbor reporting). Given a set
P of points in a d-dimensional space d, and parameters R > 0, > 0,
construct a data structure that, given any query point q, reports each
R-near neighbor of q in P with probability 1 – .
Note that the latter definition does not involve an approximation
factor. Also, unlike the case of the approximate near neighbor, here the
data structure can return many (or even all) points if a large fraction of
the data points are located close to the query point. As a result, one
cannot give an a priori bound on the running time of the algorithm.
However, as we point out later, the two problems are intimately
related. In particular, the algorithms in this article can be easily modified to solve both c-NN and the reporting problems.
2. 3 Locality-Sensitive Hashing
The LSH algorithm relies on the existence of locality-sensitive hash
functions. Let H be a family of hash functions mapping d to some
universe U. For any two points p and q, consider a process in which
we choose a function h from H uniformly at random, and analyze the
probability that h(p) = h(q). The family H is called locality sensitive
(with proper parameters) if it satisfies the following condition.
Definition 2. 3 (Locality-sensitive hashing). A family H is called (R,
cR, P d 1, P2)-sensitive if for any two points p, q .
• if p – q ≤ R then PrH [h(q) = h(p)] ≥ P1,
• if p – q ≥ cR then PrH [h(q) = h(p)] ≤ P2.
In order for a locality-sensitive hash (LSH) family to be useful, it has
to satisfy P1 > P2.