NEAR-OPTIMAL HASHING ALGORITHMS FOR
APPROXIMATE NEAREST NEIGHBOR IN HIGH DIMENSIONS
by Alexandr Andoni and Piotr Indyk
In this article, we give an overview of efficient algorithms for the approximate and exact nearest
neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given
a new query object, one can quickly return the dataset object that is most similar to the query. The
problem is of significant interest in a wide variety of areas.
The goal of this article is twofold. In the first part, we survey a family
of nearest neighbor algorithms that are based on the concept of
locality-sensitive hashing. Many of these algorithm have already been successfully
applied in a variety of practical scenarios. In the second part of this article, we describe a recently discovered hashing-based algorithm, for the
case where the objects are points in the d-dimensional Euclidean space.
As it turns out, the performance of this algorithm is provably near-optimal in the class of the locality-sensitive hashing algorithms.
The nearest neighbor problem is defined as follows: given a collection
of n points, build a data structure which, given any query point, reports
the data point that is closest to the query. A particularly interesting and
well-studied instance is where the data points live in a d-dimensional
space under some (e.g., Euclidean) distance function. This problem is
of major importance in several areas; some examples are data compression, databases and data mining, information retrieval, image and
video databases, machine learning, pattern recognition, statistics and
data analysis. Typically, the features of each object of interest (
document, image, etc.) are represented as a point in d and the distance
metric is used to measure the similarity of objects. The basic problem
then is to perform indexing or similarity searching for query objects.
The number of features (i.e., the dimensionality) ranges anywhere from
tens to millions. For example, one can represent a 1000 × 1000 image
as a vector in a 1,000,000-dimensional space, one dimension per pixel.
There are several efficient algorithms known for the case when the
dimension d is low (e.g., up to 10 or 20). The first such data structure,
called kd-trees was introduced in 1975 by Jon Bentley [ 6], and remains
one of the most popular data structures used for searching in multidimensional spaces. Many other multidimensional data structures are
known, see [ 35] for an overview. However, despite decades of intensive effort, the current solutions suffer from either space or query time
that is exponential in d. In fact, for large enough d, in theory or in prac-
Alexandr Andoni ( email@example.com) is a Ph.D. Candidate in computer
science at Massachusetts Institute of Technology, Cambridge, MA.
Piotr Indyk ( firstname.lastname@example.org) is an associate professor in the
Theory of Computation Group, Computer Science and Artificial Intelligence Lab, at Massachusetts Institute of Technology, Cambridge, MA.
tice, they often provide little improvement over a linear time algorithm
that compares a query to each point from the database. This phenomenon is often called “the curse of dimensionality.”
In recent years, several researchers have proposed methods for overcoming the running time bottleneck by using approximation (e.g., [ 5,
27, 25, 29, 22, 28, 17, 13, 32, 1], see also [ 36, 24]). In this formulation,
the algorithm is allowed to return a point whose distance from the query
is at most c times the distance from the query to its nearest points; c >
1 is called the approximation factor. The appeal of this approach is that,
in many cases, an approximate nearest neighbor is almost as good as the
exact one. In particular, if the distance measure accurately captures the
notion of user quality, then small differences in the distance should not
matter. Moreover, an efficient approximation algorithm can be used to
solve the exact nearest neighbor problem by enumerating all approximate nearest neighbors and choosing the closest point1.
In this article, we focus on one of the most popular algorithms for
performing approximate search in high dimensions based on the concept of locality-sensitive hashing (LSH) [ 25]. The key idea is to hash
the points using several hash functions to ensure that for each function the probability of collision is much higher for objects that are
close to each other than for those that are far apart. Then, one can
determine near neighbors by hashing the query point and retrieving
elements stored in buckets containing that point.
The LSH algorithm and its variants has been successfully applied
to computational problems in a variety of areas, including web clustering [ 23], computational biology [ 10. 11], computer vision (see
selected articles in [ 23]), computational drug design [ 18] and computational linguistics [ 34]. A code implementing a variant of this method
is available from the authors [ 2]. For a more theoretically-oriented
overview of this and related algorithms, see [ 24].
The purpose of this article is twofold. In Section 2, we describe the
basic ideas behind the LSH algorithm and its analysis; we also give an
overview of the current library of LSH functions for various distance
measures in Section 3. Then, in Section 4, we describe a recently
developed LSH family for the Euclidean distance, which achievies a
near-optimal separation between the collision probabilities of close
and far points. An interesting feature of this family is that it effectively
enables the reduction of the approximate nearest neighbor problem for
worst-case data to the exact nearest neighbor problem over random (or
pseudorandom) point configuration in low-dimensional spaces.
1See section 2. 4 for more information about exact algorithms.