technical Perspective
strange effects
in high Dimension
By Sanjoy Dasgupta
In STUDYInG THE genetic basis of a disease, it is now common to select a set
of relevant genes G, and to measure
how strongly they are expressed in cell
samples from a group of patients, some
healthy and some ill. 1 The expression
level of a gene is mapped to a value in
[– 1, 1]. Each patient’s data is then a vector
with one entry per gene, or equivalently,
a point in R|G|. The size of G is frequently
in the thousands, which makes the data
high-dimensional by present standards.
Analyzing data in a high-dimensional space Rd is rife with challenges.
First, many statistical estimators are
accurate only when the number of data
points (call it n) is orders of magnitude
larger than d. Some models, such as
histograms, have error rates of the form
n–1/d, so that to halve the error, you need
2d times as much data: bad news even
for double-digit d. A second difficulty is
computational, as typified by the ever-important 2-means clustering problem:
divide a data set into two groups, each
with a designated center, to minimize
the average squared distance from data
points to their closest centers. Naive
approaches run in time nd, which is astronomical even for smallish d, and NP-hardness results have dampened hopes
for dramatically better algorithms. As
a result, such problems are attacked
with heuristics that offer no guarantees
on their solutions. Understanding the
quality of these schemes is doubly tricky
because they are often justified on intuitive grounds, inevitably informed by
experiences of 2D and 3D space—while
high dimensional spaces are full of
strange and counterintuitive effects.
Such complications are collectively
referred to as the curse of dimension: A
superstition that the murky realm of
high dimension brings bad luck. But
mathematical analysis is starting to
clear away the haze. A pleasant surprise
is that the counterintuitive geometry of
high-dimensional space, once properly
characterized, can be exploited to defeat
other facets of the curse.
Even a solid ball—the simplest of objects—has unusual aspects in high dimension. For d= 2 or d= 3, a set of points
picked at random from the unit ball Bd =
{x in Rd: ||x|| ≤ 1} will have some significant fraction near the origin, say within
distance ½ of it. But we wouldn’t find
even one such point in high dimension
unless we were to draw exponentially
many points from the ball. This is because points at distance r < 1 from the
origin constitute an rd fraction of Bd, and
this fraction goes rapidly to zero with
rising dimension. For large d, the supposedly filled-in ball Bd is in fact well approximated by a thin, hollow shell, {x in
Rd: 1– e≤ ||x|| ≤ 1} for e = O(1/d).
Here’s another curiosity. Suppose we
need lots of vectors in Rd that are orthogonal (at right angles) to each other. How
many can we get? Exactly d, because this
is the maximum number of linearly independent vectors. But if we only need
the curse of
dimension:
A superstition
that the murky realm
of high dimension
brings bad luck. But
mathematical analysis
is starting to clear
away the haze.
A pleasant surprise is
that counterintuitive
geometry can
defeat other facets
of the curse.
them approximately orthogonal—with
angles that need not be exactly 90 degrees, but 90±e—then we can find exponentially many vectors. A collection
of exp(O(e2 d)) vectors picked at random
from the surface of Bd will with high
probability satisfy the angle constraint.
These examples hint at the strangeness of high-dimensional space. However, such effects do not directly help
with data analysis because they pertain
to very specialized sets of points—those
chosen randomly from the unit ball—
whereas real data sets might look different. The trick is to take an arbitrary data
set and then add randomness to it in
such a way that the outcome is helpful
and predictable.
An early groundbreaking result of
this kind was Dvoretsky’s theorem. 2 Let K
be a convex body in Rd: for instance, the
feasible region of a linear program with
d variables. K could be enormously complicated. But a random slice through K
(of appropriate dimension) will with high
probability be almost ellipsoidal. More
precisely, it will contain an ellipsoid E
and be contained within ( 1+o( 1))E.
A more recent result is the
Johnson-Lindenstrauss theorem. 3 Take any n
points in Euclidean space of arbitrarily
high dimension. If they are projected
into a random subspace of O(log n) dimensions, distances between points
will with high probability be almost perfectly preserved. Since clustering and
other forms of data analysis frequently
depend only on interpoint distances,
the dimension of data sets can automatically be reduced to O(log n).
The following paper by Nir Ailon and
Bernard Chazelle entitled “Faster Dimension Reduction” demonstrates an
ingenious variant of this theorem that
permits the projection to be achieved
especially fast.
References
1. alizadeh, a. et al. Distinct types of diffuse large b-cell
lymphoma identified by gene expression profiling.
Nature 403 (2000), 503–511.
2. Dvoretzky, a. Some results on convex bodies and
banach spaces. in Proceedings of the International
Symposium on Linear Spaces ( Jerusalem, 1961),
123–160.
3. Johnson, W. and Lindenstrauss, J. Extensions of
Lipschitz maps into a hilbert space. Contemporary
Mathematics 26 (1984), 189–206.
Sanjoy Dasgupta is an associate professor in the
Department of computer Science and Engineering at the
University of california, San Diego.