If we shrink the hypothesis space, the
bound improves, but the chances that
it contains the true classifier shrink
also. (There are bounds for the case
where the true classifier is not in the
hypothesis space, but similar considerations apply to them.)
Another common type of theoretical guarantee is asymptotic: given infinite data, the learner is guaranteed
to output the correct classifier. This
is reassuring, but it would be rash to
choose one learner over another because of its asymptotic guarantees. In
practice, we are seldom in the asymptotic regime (also known as “
asymp-topia”). And, because of the bias-variance trade-off I discussed earlier, if
learner A is better than learner B given
infinite data, B is often better than A
given finite data.
The main role of theoretical guarantees in machine learning is not as
a criterion for practical decisions,
but as a source of understanding and
driving force for algorithm design. In
this capacity, they are quite useful; indeed, the close interplay of theory and
practice is one of the main reasons
machine learning has made so much
progress over the years. But caveat
emptor: learning is a complex phenomenon, and just because a learner
has a theoretical justification and
works in practice does not mean the
former is the reason for the latter.
Feature Engineering Is The Key
At the end of the day, some machine
learning projects succeed and some
fail. What makes the difference? Easily the most important factor is the
features used. Learning is easy if you
have many independent features that
each correlate well with the class. On
the other hand, if the class is a very
complex function of the features, you
may not be able to learn it. Often, the
raw data is not in a form that is amenable to learning, but you can construct features from it that are. This
is typically where most of the effort in
a machine learning project goes. It is
often also one of the most interesting
parts, where intuition, creativity and
“black art” are as important as the
technical stuff.
First-timers are often surprised by
how little time in a machine learning
project is spent actually doing ma-
A dumb algorithm
with lots and lots
of data beats
a clever one
with modest
amounts of it.
84 COMMUNICATIONS OF THE ACM | OCTOBER 2012 | VOL. 55 | NO. 10
chine learning. But it makes sense if
you consider how time-consuming it
is to gather data, integrate it, clean it
and preprocess it, and how much trial
and error can go into feature design.
Also, machine learning is not a one-shot process of building a dataset and
running a learner, but rather an iterative process of running the learner,
analyzing the results, modifying the
data and/or the learner, and repeating. Learning is often the quickest
part of this, but that is because we
have already mastered it pretty well!
Feature engineering is more difficult because it is domain-specific,
while learners can be largely general
purpose. However, there is no sharp
frontier between the two, and this is
another reason the most useful learners are those that facilitate incorporating knowledge.
Of course, one of the holy grails
of machine learning is to automate
more and more of the feature engineering process. One way this is often
done today is by automatically generating large numbers of candidate features and selecting the best by (say)
their information gain with respect
to the class. But bear in mind that
features that look irrelevant in isolation may be relevant in combination.
For example, if the class is an XOR of
k input features, each of them by itself carries no information about the
class. (If you want to annoy machine
learners, bring up XOR.) On the other
hand, running a learner with a very
large number of features to find out
which ones are useful in combination
may be too time-consuming, or cause
overfitting. So there is ultimately no
replacement for the smarts you put
into feature engineering.
More Data Beats
a Cleverer Algorithm
Suppose you have constructed the
best set of features you can, but the
classifiers you receive are still not ac-
curate enough. What can you do now?
There are two main choices: design a
better learning algorithm, or gather
more data (more examples, and pos-
sibly more raw features, subject to
the curse of dimensionality). Machine
learning researchers are mainly con-
cerned with the former, but pragmati-
cally the quickest path to success is