K-nearest neighbor
Support vector machines
Hyperplanes
naive bayes
logistic regression
Decision trees
Sets of rules
Propositional rules
logic programs
neural networks
Graphical models
bayesian networks
conditional random fields
Evaluation
Accuracy/error rate
Precision and recall
Squared error
likelihood
Posterior probability
Information gain
K-l divergence
cost/Utility
Margin
Optimization
combinatorial optimization
Greedy search
beam search
branch-and-bound
continuous optimization
Unconstrained
Gradient descent
conjugate gradient
Quasi-newton methods
constrained
linear programming
Quadratic programming
sible different inputs.) Doing well on
the training set is easy (just memorize
the examples). The most common
mistake among machine learning beginners is to test on the training data
and have the illusion of success. If the
chosen classifier is then tested on new
data, it is often no better than random guessing. So, if you hire someone
to build a classifier, be sure to keep
some of the data to yourself and test
the classifier they give you on it. Conversely, if you have been hired to build
a classifier, set some of the data aside
from the beginning, and only use it to
test your chosen classifier at the very
end, followed by learning your final
classifier on the whole data.
Algorithm 1. Decision tree induction.
LearnDT (TrainSet)
if all examples in TrainSet have the same class y then
return Makeleaf(y*)
if no feature xj has InfoGain(xj ,y) > 0 then
y ← Most frequent class in TrainSet
return Makeleaf(y*)
x ← argmaxxj InfoGain(xj, y)
TS0 ← examples in TrainSet with x = 0
TS1 ← examples in TrainSet with x = 1
return Makenode(x*, learnDt(TS0), learnDt( TS1))
Contamination of your classifier by
test data can occur in insidious ways,
for example, if you use test data to
tune parameters and do a lot of tuning. (Machine learning algorithms
have lots of knobs, and success often comes from twiddling them a lot,
so this is a real concern.) Of course,
holding out data reduces the amount
available for training. This can be mitigated by doing cross-validation: randomly dividing your training data into
(say) 10 subsets, holding out each one
while training on the rest, testing each
learned classifier on the examples it
did not see, and averaging the results
to see how well the particular parameter setting does.
combination of the features per class
and predict the class with the high-est-valued combination. Decision
trees test one feature at each internal
node, with one branch for each feature value, and have class predictions
at the leaves. Algorithm 1 (above)
shows a bare-bones decision tree
learner for Boolean domains, using
information gain and greedy search.20
InfoGain(xj, y) is the mutual information between feature xj and the class y.
MakeNode(x,c0,c1) returns a node that
tests feature x and has c0 as the child
for x = 0 and c1 as the child for x = 1.
day may not be far when every single
possible combination has appeared in
some learner!
Most textbooks are organized by
representation, and it is easy to overlook the fact that the other components are equally important. There is
no simple recipe for choosing each
component, but I will touch on some
of the key issues here. As we will see,
some choices in a machine learning
project may be even more important
than the choice of learner.
In the early days of machine learning, the need to keep training and test
data separate was not widely appreciated. This was partly because, if the
learner has a very limited representation (for example, hyperplanes), the
difference between training and test
error may not be large. But with very
flexible classifiers (for example, decision trees), or even with linear classifiers with a lot of features, strict separation is mandatory.
Of course, not all combinations of
one component from each column of
the table make equal sense. For example, discrete representations naturally
go with combinatorial optimization,
and continuous ones with continuous optimization. Nevertheless, many
learners have both discrete and continuous components, and in fact the
It’s Generalization that Counts
The fundamental goal of machine
learning is to generalize beyond the
examples in the training set. This is
because, no matter how much data
we have, it is very unlikely that we will
see those exact examples again at test
time. (Notice that, if there are 100,000
words in the dictionary, the spam filter described above has 2100,000 pos-
Notice that generalization being
the goal has an interesting consequence for machine learning. Unlike
in most other optimization problems,
we do not have access to the function
we want to optimize! We have to use
training error as a surrogate for test
error, and this is fraught with danger. (How to deal with it is addressed
later.) On the positive side, since the
objective function is only a proxy for
the true goal, we may not need to fully