theorems of the form “Every function
can be represented, or approximated
arbitrarily closely, using this representation.” Reassured by this, fans of
the representation often proceed to
ignore all others. However, just because a function can be represented
does not mean it can be learned. For
example, standard decision tree learners cannot learn trees with more leaves
than there are training examples. In
continuous spaces, representing even
simple functions using a fixed set of
primitives often requires an infinite
number of components. Further, if
the hypothesis space has many local
optima of the evaluation function, as
is often the case, the learner may not
find the true function even if it is representable. Given finite data, time and
memory, standard learners can learn
only a tiny subset of all possible functions, and these subsets are different
for learners with different representations. Therefore the key question is
not “Can it be represented?” to which
the answer is often trivial, but “Can it
be learned?” And it pays to try different
learners (and possibly combine them).
Some representations are exponentially more compact than others for
some functions. As a result, they may
also require exponentially less data to
learn those functions. Many learners
work by forming linear combinations
of simple basis functions. For example, support vector machines form
combinations of kernels centered at
some of the training examples (the
support vectors). Representing parity
of n bits in this way requires 2n basis
functions. But using a representation
with more layers (that is, more steps
between input and output), parity can
be encoded in a linear-size classifier.
Finding methods to learn these deeper
representations is one of the major research frontiers in machine learning.
2
Correlation Does not
imply Causation
The point that correlation does not
imply causation is made so often that
it is perhaps not worth belaboring.
But, even though learners of the kind
we have been discussing can only
learn correlations, their results are
often treated as representing causal
relations. Isn’t this wrong? If so, then
why do people do it?
More often than not, the goal
of learning predictive models is to
use them as guides to action. If we
find that beer and diapers are often
bought together at the supermarket, then perhaps putting beer next
to the diaper section will increase
sales. (This is a famous example in
the world of data mining.) But short
of actually doing the experiment it is
difficult to tell. Machine learning is
usually applied to observational data,
where the predictive variables are not
under the control of the learner, as
opposed to experimental data, where
they are. Some learning algorithms
can potentially extract causal information from observational data, but
their applicability is rather restricted.
19 On the other hand, correlation
is a sign of a potential causal connection, and we can use it as a guide to
further investigation (for example,
trying to understand what the causal
chain might be).
Many researchers believe that causality is only a convenient fiction. For
example, there is no notion of causality in physical laws. Whether or not
causality really exists is a deep philosophical question with no definitive
answer in sight, but there are two
practical points for machine learners. First, whether or not we call them
“causal,” we would like to predict the
effects of our actions, not just correlations between observable variables.
Second, if you can obtain experimental data (for example by randomly assigning visitors to different versions of
a Web site), then by all means do so.
14
Conclusion
Like any discipline, machine learning has a lot of “folk wisdom” that can
be difficult to come by, but is crucial
for success. This article summarized
some of the most salient items. Of
course, it is only a complement to the
more conventional study of machine
learning.Ch eck out http://www.
cs.washington.edu/homes/pedrod/
class for a complete online machine
learning course that combines formal
and informal aspects. There is also a
treasure trove of machine learning
lectures at http://www.videolectures.
net. A good open source machine
learning toolkit is Weka.
24
Happy learning!
References
1. bauer, e. and kohavi, r. an empirical comparison of
voting classification algorithms: bagging, boosting
and variants. Machine Learning 36 (1999), 105–142.
2. bengio, y. learning deep architectures for aI.
Foundations and Trends in Machine Learning 2, 1
(2009), 1–127.
3. benjamini, y. and hochberg, y. Controlling the false
discovery rate: a practical and powerful approach
to multiple testing. Journal of the Royal Statistical
Society, Series B, 57 (1995), 289–300.
4. bernardo, J.M. and smith, a.F.M. Bayesian Theory.
Wiley, ny, 1994.
5. blumer, a., ehrenfeucht, a., haussler, d. and
Warmuth, M.k. occam’s razor. Information
Processing Letters 24 (1987), 377–380.
6. Cohen, W. W. grammatically biased learning:
learning logic programs using an explicit antecedent
description language. Artificial Intelligence 68
(1994), 303–366.
7. domingos, P. the role of occam’s razor in knowledge
discovery. Data Mining and Knowledge Discovery 3
(1999), 409–425.
8. domingos, P. bayesian averaging of classifiers and
the overfitting problem. In Proceedings of the 17th
International Conference on Machine Learning
(stanford, Ca, 2000), Morgan kaufmann, san Mateo,
Ca, 223–230.
9. domingos, P. a unified bias-variance decomposition
and its applications. In Proceedings of the 17th
International Conference on Machine Learning
(stanford, Ca, 2000), Morgan kaufmann, san Mateo,
Ca, 231–238.
10. domingos, P. and Pazzani, M. on the optimality of
the simple bayesian classifier under zero-one loss.
Machine Learning 29 (1997), 103–130.
11. hulten, g. and domingos, P. Mining complex models
from arbitrarily large databases in constant time. In
Proceedings of the 8th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
(edmonton, Canada, 2002). aCM Press, ny, 525–531.
12. kibler, d. and langley, P. Machine learning as an
experimental science. In Proceedings of the 3rd
European Working Session on Learning (london, uk,
1988). Pitman.
13. klockars, a.J. and sax, g. Multiple Comparisons.
sage, beverly hills, Ca, 1986.
14. kohavi, r., longbotham, r., sommerfield, d. and
henne, r. Controlled experiments on the Web:
survey and practical guide. Data Mining and
Knowledge Discovery 18 (2009), 140–181.
15. Manyika, J., Chui, M., brown, b., bughin, J., dobbs,
r., roxburgh, C. and byers, a. big data: the next
frontier for innovation, competition, and productivity.
technical report, Mckinsey global Institute, 2011.
16. Mitchell, t.M. Machine Learning. Mcgraw-hill,
ny, 1997.
17. ng, a.y. Preventing “overfitting” of cross-validation
data. In Proceedings of the 14th International
Conference on Machine Learning (nashville, tn,
1997). Morgan kaufmann, san Mateo, Ca, 245–253.
18. Pearl, J. on the connection between the complexity
and credibility of inferred models. International
Journal of General Systems 4 (1978), 255–264.
19. Pearl, J. Causality: Models, Reasoning, and
Inference. Cambridge university Press, Cambridge,
uk, 2000.
20. Quinlan, J.r. C4.5: Programs for Machine Learning.
Morgan kaufmann, san Mateo, Ca, 1993.
21. richardson, M. and P. domingos. Markov logic
networks. Machine Learning 62 (2006), 107–136.
22. tenenbaum, J., silva, V. and langford, J. a global
geometric framework for nonlinear dimensionality
reduction. Science 290 (2000), 2319–2323.
23. Vapnik, V.n. The Nature of Statistical Learning
Theory. springer, ny, 1995.
24. Witten, I., Frank, e. and hall, M. Data Mining:
Practical Machine Learning Tools and Techniques,
3rd Edition. Morgan kaufmann, san Mateo, Ca, 2011.
25. Wolpert, d. the lack of a priori distinctions between
learning algorithms. Neural Computation 8 (1996),
1341–1390.
Pedro Domingos ( pedrod@cs.washington.edu) is a
professor in the department of Computer science and
engineering at the university of Washington, seattle.
© 2012 aCM 0001-0782/12/10 $15.00