l2 distance on a sphere. Terasawa and Tanaka [ 37] propose an LSH
algorithm specifically designed for points that are on a unit hypersphere in the Euclidean space. The idea is to consider a regular polytope, orthoplex for example, inscribed into the hypersphere and
rotated at random. The hash function then maps a point on the hypersphere into the closest polytope vertex lying on the hypersphere. Thus,
the buckets of the hash function are the Voronoi cells of the polytope
vertices lying on the hypersphere. [ 37] obtain exponent that is an
improvement over [ 17] and the Leech lattice approach of [ 3].
4 Near-Optimal LSH Functions for
Euclidean Distance
In this section we present a new LSH family, yielding an algorithm
with query time exponent ( c) = 1/c2 + O(log log n / log1/3 n). For
large enough n, the value of (c) tends to 1/c2. This significantly
improves upon the earlier running time of [ 17]. In particular, for c = 2,
our exponent tends to 0.25, while the exponent in [ 17] was around
0.45. Moreover, a recent paper [ 31] shows that hashing-based algorithms (as described in Section 2. 3) cannot achieve < 0.462/c2.
Thus, the running time exponent of our algorithm is essentially optimal, up to a constant factor.
We obtain our result by carefully designing a family of locality-sensitive hash functions in l2. The starting point of our construction is the
line partitioning method of [ 17]. There, a point p was mapped into 1
using a random projection. Then, the line 1 was partitioned into
intervals of length w, where w is a parameter. The hash function for p
returned the index of the interval containing the projection of p.
An analysis in [ 17] showed that the query time exponent has an
interesting dependence on the parameter w. If w tends to infinity, the
exponent tends to 1/c, which yields no improvement over [ 25, 19].
However, for small values of w, the exponent lies slightly below 1/c. In
fact, the unique minimum exists for each c.
In this article, we utilize a “multi-dimensional version” of the aforementioned approach. Specifically, we first perform random projection
into t, where t is super-constant, but relatively small (i.e., t = o (log n)).
Then we partition the space t into cells. The hash function function
returns the index of the cell which contains projected point p.
The partitioning of the space t is somewhat more involved than
its one-dimensional counterpart. First, observe that the natural idea of
partitioning using a grid does not work. This is because this process
roughly corresponds to hashing using concatenation of several one-dimensional functions (as in [ 17]). Since the LSH algorithms perform
such concatenation anyway, grid partitioning does not result in any
improvement. Instead, we use the method of “ball partitioning”, introduced in [ 15], in the context of embeddings into tree metrics. The partitioning is obtained as follows. We create a sequence of balls B1, B2…,
each of radius w, with centers chosen independently at random. Each
ball Bi then defines a cell, containing points Bi\ j< i Bj.
In order to apply this method in our context, we need to take care
of a few issues. First, locating a cell containing a given point could
require enumeration of all balls, which would take an unbounded
amount of time. Instead, we show that one can simulate this procedure by replacing each ball by a grid of balls. It is not difficult then to
observe that a finite (albeit exponential in t) number U of such grids
suffices to cover all points in t. An example of such partitioning (for
t = 2 and U = 5) is given in Figure 4.
Fig. 4. An illustration of the the ball partitioning of
the 2-dimensional space.
The second and the main issue is the choice of w. Again, it turns
out that for large w, the method yields only the exponent of 1/c.
Specifically, it was shown in [ 15] that for any two points p, q t , the
probability that the partitioning separates p and q is at most
O t · p – q / w . This formula can be showed to be tight for the range
of w where it makes sense as a lower bound, that is, for w =
t · p – q . However, as long as the separation probability depends
linearly on the distance between p and q, the exponent is still equal
to 1/c. Fortunately, a more careful analysis4 shows that, as in the one-dimensional case, the minimum is achieved for finite w. For that value
of w, the exponent tends to 1/c2 as t tends to infinity.
5 Related Work
In this section, we give a brief overview of prior work in the spirit of
the algorithms considered in this article. We give only high-level simplified descriptions of the algorithms to avoid area-specific terminology. Some of the papers considered a closely related problem of finding
all close pairs of points in a dataset. For simplicity, we translate them
into the near neighbor framework since they can be solved by performing essentialy n separate near neighbor queries.
Hamming distance. Several papers investigated multi-index hashing-based algorithms for retrieving similar pairs of vectors with respect to
the Hamming distance. Typically, the hash functions were projecting
the vectors on some subset of the coordinates { 1…d } as in the example from an earlier section. In some papers [ 33, 21], the authors considered the probabilistic model where the data points are chosen
uniformly at random, and the query point is a random point close to one
of the points in the dataset. A different approach [ 26] is to assume that
the dataset is arbitrary, but almost all points are far from the query
point. Finally, the paper [ 12] proposed an algorithm which did not make
any assumption on the input. The analysis of the algorithm was akin to
the analysis sketched at the end of section 2.4: the parameters k and L
were chosen to achieve desired level of sensitivity and accuracy.
Set intersection measure. To measure the similarity between two sets
A and B, the authors of [ 9, 8] considered the Jaccard coefficient s (A, B),
proposing a family of hash functions h(A) such that Pr[h(A) = h(B)] =
s(A, B) (presented in detail in Section 3). Their main motivation was to
4Refer to [ 3] for more details.