Faster Dimension Reduction
By Nir Ailon and Bernard Chazelle
Doi: 10.1145/1646353.1646379
Abstract
Data represented geometrically in high-dimensional vector spaces can be found in many applications. Images and
videos, are often represented by assigning a dimension
for every pixel (and time). Text documents may be represented in a vector space where each word in the dictionary incurs a dimension. The need to manipulate such data
in huge corpora such as the web and to support various
query types gives rise to the question of how to represent
the data in a lower-dimensional space to allow more space
and time efficient computation. Linear mappings are an
attractive approach to this problem because the mapped
input can be readily fed into popular algorithms that operate on linear spaces (such as principal-component analysis, PCA) while avoiding the curse of dimensionality.
The fact that such mappings even exist became known
in computer science following seminal work by Johnson
and Lindenstrauss in the early 1980s. The underlying
technique is often called “random projection.” The complexity of the mapping itself, essentially the product of a
vector with a dense matrix, did not attract much attention
until recently. In 2006, we discovered a way to “sparsify”
the matrix via a computational version of Heisenberg’s
Uncertainty Principle. This led to a significant speedup,
which also retained the practical simplicity of the standard Johnson–Lindenstrauss projection. We describe
the improvement in this article, together with some of its
applications.
1. intRoDuction
Dimension reduction, as the name suggests, is an algo-
rithmic technique for reducing the dimensionality of data.
From a programmer’s point of view, a d-dimensional array
of real numbers, after applying this technique, is repre-
sented by a much smaller array. This is useful because
many data-centric applications suffer from exponential
blowup as the underlying dimension grows. The infamous
curse of dimensionality (exponential dependence of an
algorithm on the dimension of the input) can be avoided
if the input data is mapped into a space of logarithmic
dimension (or less); for example, an algorithm running
in time proportional to 2d in dimension d will run in lin-
ear time if the dimension can be brought down to log
d. Common beneficiaries of this approach are cluster-
ing and nearest neighbor searching algorithms. One
typical case involving both is, for example, organizing a
massive corpus of documents in a database that allows
one to respond quickly to similar-document searches.
The clustering is used in the back-end to eliminate (near)
duplicates, while nearest-neighbor queries are processed
at the front-end. Reducing the dimensionality of the
data helps the system respond faster to both queries and
data updates. The idea, of course, is to retain the basic
metric properties of the data set (e.g., pairwise distances)
while reducing its size. Because this is technically
impossible to do, one will typically relax this demand and
tolerate errors as long as they can be made arbitrarily
small.