the same. Similar reasoning applies
to the choice of optimization method: beam search has lower bias than
greedy search, but higher variance, because it tries more hypotheses. Thus,
contrary to intuition, a more powerful
learner is not necessarily better than a
less powerful one.
Figure 2 illustrates this.a Even
though the true classifier is a set of
rules, with up to 1,000 examples naive Bayes is more accurate than a
rule learner. This happens despite
naive Bayes’s false assumption that
the frontier is linear! Situations like
this are common in machine learning: strong false assumptions can be
better than weak true ones, because
a learner with the latter needs more
data to avoid overfitting.
Cross-validation can help to combat overfitting, for example by using it
to choose the best size of decision tree
to learn. But it is no panacea, since if
we use it to make too many parameter
choices it can itself start to overfit.
17
Besides cross-validation, there
are many methods to combat overfitting. The most popular one is adding
a regularization term to the evaluation
function. This can, for example, penalize classifiers with more structure,
thereby favoring smaller ones with
less room to overfit. Another option
is to perform a statistical significance
test like chi-square before adding new
structure, to decide whether the distribution of the class really is different with and without this structure.
These techniques are particularly useful when data is very scarce. Nevertheless, you should be skeptical of claims
that a particular technique “solves”
the overfitting problem. It is easy to
avoid overfitting (variance) by falling
into the opposite error of underfitting
(bias). Simultaneously avoiding both
requires learning a perfect classifier,
and short of knowing it in advance
there is no single technique that will
always do best (no free lunch).
A common misconception about
overfitting is that it is caused by noise,
a Training examples consist of 64 Boolean features and a Boolean class computed from
them according to a set of “IF . . . THEN . . .”
rules. The curves are the average of 100 runs
with different randomly generated sets of
rules. Error bars are two standard deviations.
See Domingos and Pazzani10 for details.
like training examples labeled with
the wrong class. This can indeed aggravate overfitting, by making the
learner draw a capricious frontier to
keep those examples on what it thinks
is the right side. But severe overfitting
can occur even in the absence of noise.
For instance, suppose we learn a Boolean classifier that is just the disjunction of the examples labeled “true”
in the training set. (In other words,
the classifier is a Boolean formula in
disjunctive normal form, where each
term is the conjunction of the feature
values of one specific training example.) This classifier gets all the training
examples right and every positive test
example wrong, regardless of whether
the training data is noisy or not.
The problem of multiple testing13 is
closely related to overfitting. Standard
statistical tests assume that only one
hypothesis is being tested, but modern learners can easily test millions
before they are done. As a result what
looks significant may in fact not be.
For example, a mutual fund that beats
the market 10 years in a row looks very
impressive, until you realize that, if
there are 1,000 funds and each has a
50% chance of beating the market on
any given year, it is quite likely that
one will succeed all 10 times just by
luck. This problem can be combatted
by correcting the significance tests to
take the number of hypotheses into
account, but this can also lead to underfitting. A better approach is to control the fraction of falsely accepted
non-null hypotheses, known as the
false discovery rate.
3
intuition Fails in high Dimensions
After overfitting, the biggest problem
in machine learning is the curse of
dimensionality. This expression was
coined by Bellman in 1961 to refer
to the fact that many algorithms that
work fine in low dimensions become
intractable when the input is high-
dimensional. But in machine learn-
ing it refers to much more. General-
izing correctly becomes exponentially
harder as the dimensionality (number
of features) of the examples grows, be-
cause a fixed-size training set covers a
dwindling fraction of the input space.
Even with a moderate dimension of
100 and a huge training set of a trillion
examples, the latter covers only a frac-
tion of about 10− 18 of the input space.
This is what makes machine learning
both necessary and hard.