spection. (It has even been said that if
people could see in high dimensions
machine learning would not be necessary.) But in high dimensions it is difficult to understand what is happening. This in turn makes it difficult to
design a good classifier. Naively, one
might think that gathering more features never hurts, since at worst they
provide no new information about the
class. But in fact their benefits may
be outweighed by the curse of dimensionality.
Fortunately, there is an effect that
partly counteracts the curse, which
might be called the “blessing of nonuniformity.” In most applications
examples are not spread uniformly
throughout the instance space, but
are concentrated on or near a lower-dimensional manifold. For example,
k-nearest neighbor works quite well
for handwritten digit recognition
even though images of digits have
one dimension per pixel, because the
space of digit images is much smaller
than the space of all possible images.
Learners can implicitly take advantage of this lower effective dimension,
or algorithms for explicitly reducing
the dimensionality can be used (for
example, Tenenbaum22).
Theoretical Guarantees
Are Not What They Seem
Machine learning papers are full of
theoretical guarantees. The most common type is a bound on the number of
examples needed to ensure good generalization. What should you make of
these guarantees? First of all, it is remarkable that they are even possible.
Induction is traditionally contrasted
with deduction: in deduction you can
guarantee that the conclusions are
correct; in induction all bets are off.
Or such was the conventional wisdom
for many centuries. One of the major
developments of recent decades has
been the realization that in fact we can
have guarantees on the results of induction, particularly if we are willing
to settle for probabilistic guarantees.
The basic argument is remarkably
simple.5 Let’s say a classifier is bad
if its true error rate is greater than ε.
Then the probability that a bad classifier is consistent with n random, independent training examples is less
than (1 − ε)n. Let b be the number of
One of the major
developments of
recent decades has
been the realization
that we can have
guarantees on the
results of induction,
particularly if we
are willing to settle
for probabilistic
guarantees.
bad classifiers in the learner’s hypothesis space H. The probability that at
least one of them is consistent is less
than b(1 − ε)n, by the union bound. Assuming the learner always returns a
consistent classifier, the probability
that this classifier is bad is then less
than |H|(1 − ε)n, where we have used
the fact that b ≤ |H|. So if we want this
probability to be less than δ, it suffices
to make n > ln(δ/|H|)/ ln(1 − ε) ≥ 1/ε (ln
|H| + ln 1/δ).
Unfortunately, guarantees of this
type have to be taken with a large grain
of salt. This is because the bounds obtained in this way are usually extremely loose. The wonderful feature of the
bound above is that the required number of examples only grows logarithmically with |H| and 1/δ. Unfortunately, most interesting hypothesis spaces
are doubly exponential in the number
of features d, which still leaves us
needing a number of examples exponential in d. For example, consider
the space of Boolean functions of d
Boolean variables. If there are e possible different examples, there are
2e possible different functions, so
since there are 2d possible examples,
the total number of functions is 22d.
And even for hypothesis spaces that
are “merely” exponential, the bound
is still very loose, because the union
bound is very pessimistic. For example, if there are 100 Boolean features
and the hypothesis space is decision
trees with up to 10 levels, to guarantee
δ = ε = 1% in the bound above we need
half a million examples. But in practice a small fraction of this suffices for
accurate learning.
Further, we have to be careful
about what a bound like this means.
For instance, it does not say that, if
your learner returned a hypothesis
consistent with a particular training
set, then this hypothesis probably
generalizes well. What it says is that,
given a large enough training set, with
high probability your learner will either return a hypothesis that generalizes well or be unable to find a consistent hypothesis. The bound also says
nothing about how to select a good
hypothesis space. It only tells us that,
if the hypothesis space contains the
true classifier, then the probability
that the learner outputs a bad classifier decreases with training set size.