system ( http://netflixprize.com). As
the competition progressed, teams
found they obtained the best results
by combining their learners with other teams’, and merged into larger and
larger teams. The winner and runner-up were both stacked ensembles of
over 100 learners, and combining the
two ensembles further improved the
results. Doubtless we will see even
larger ones in the future.
Model ensembles should not be
confused with Bayesian model averaging (BMA)—the theoretically
optimal approach to learning.
4 In
BMA, predictions on new examples
are made by averaging the individual
predictions of all classifiers in the
hypothesis space, weighted by how
well the classifiers explain the training data and how much we believe
in them a priori. Despite their superficial similarities, ensembles and
BMA are very different. Ensembles
change the hypothesis space (for example, from single decision trees to
linear combinations of them), and
can take a wide variety of forms. BMA
assigns weights to the hypotheses in
the original space according to a fixed
formula. BMA weights are extremely
different from those produced by
(say) bagging or boosting: the latter
are fairly even, while the former are
extremely skewed, to the point where
the single highest-weight classifier
usually dominates, making BMA effectively equivalent to just selecting
it.
8 A practical consequence of this is
that, while model ensembles are a key
part of the machine learning toolkit,
BMA is seldom worth the trouble.
Simplicity Does not
imply Accuracy
Occam’s razor famously states that
entities should not be multiplied beyond necessity. In machine learning,
this is often taken to mean that, given
two classifiers with the same training
error, the simpler of the two will likely
have the lowest test error. Purported
proofs of this claim appear regularly
in the literature, but in fact there are
many counterexamples to it, and the
“no free lunch” theorems imply it cannot be true.
We saw one counterexample previously: model ensembles. The generalization error of a boosted ensemble
Just because
a function can
be represented
does not mean
it can be learned.
continues to improve by adding classifiers even after the training error has
reached zero. Another counterexample is support vector machines, which
can effectively have an infinite number of parameters without overfitting.
Conversely, the function sign(sin(ax))
can discriminate an arbitrarily large,
arbitrarily labeled set of points on the
x axis, even though it has only one parameter.
23 Thus, contrary to intuition,
there is no necessary connection between the number of parameters of a
model and its tendency to overfit.
A more sophisticated view instead
equates complexity with the size of
the hypothesis space, on the basis that
smaller spaces allow hypotheses to be
represented by shorter codes. Bounds
like the one in the section on theoretical guarantees might then be viewed
as implying that shorter hypotheses
generalize better. This can be further
refined by assigning shorter codes to
the hypotheses in the space we have
some a priori preference for. But
viewing this as “proof” of a trade-off
between accuracy and simplicity is
circular reasoning: we made the hypotheses we prefer simpler by design,
and if they are accurate it is because
our preferences are accurate, not because the hypotheses are “simple” in
the representation we chose.
A further complication arises from
the fact that few learners search their
hypothesis space exhaustively. A
learner with a larger hypothesis space
that tries fewer hypotheses from it
is less likely to overfit than one that
tries more hypotheses from a smaller
space. As Pearl18 points out, the size of
the hypothesis space is only a rough
guide to what really matters for relating training and test error: the procedure by which a hypothesis is chosen.
Domingos7 surveys the main arguments and evidence on the issue of
Occam’s razor in machine learning.
The conclusion is that simpler hypotheses should be preferred because
simplicity is a virtue in its own right,
not because of a hypothetical connection with accuracy. This is probably
what Occam meant in the first place.
Representable Does not
imply Learnable
Essentially all representations used in
variable-size learners have associated