a Better Way
to Learn features
By Geoffrey E. Hinton
A T YPICAL MACHINE learning program uses
weighted combinations of features
to discriminate between classes or to
predict real-valued outcomes. The art
of machine learning is in constructing
the features, and a radically new method of creating features constitutes a
In the 1980s, the new method was
backpropagation, which uses the chain
rule to backpropagate error derivatives
through a multilayer, feed-forward,
neural network and adjusts the weights
between layers by following the gradient of the backpropagated error. This
worked well for recognizing simple
shapes, such as handwritten digits,
especially in convolutional neural networks that use local feature detectors
replicated across the image.
5 For many
tasks, however, it proved extremely difficult to optimize deep neural nets with
many layers of non-linear features,
and a huge number of labeled training
cases was required for large neural networks to generalize well to test data.
In the 1990s, Support Vector Machines (SVMs)
8 introduced a very different way of creating features: the user
defines a kernel function that computes the similarity between two input
vectors, then a judiciously chosen subset of the training examples is used to
create “landmark” features that measure how similar a test case is to each
training case. SVMs have a clever way
of choosing which training cases to
use as landmarks and deciding how
to weight them. They work remarkably
well on many machine learning tasks
even though the selected features are
The success of SVMs dampened
the earlier enthusiasm for neural
networks. More recently, however, it
has been shown that multiple layers
of feature detectors can be learned
greedily, one layer at a time, by using
unsupervised learning that does not
require labeled data. The features in
each layer are designed to model the
statistical structure of the patterns
of feature activations in the previous
layer. After learning several layers of
features this way without paying any
attention to the final goal, many of the
high-level features will be irrelevant
for any particular task, but others will
be highly relevant because high-order
correlations are the signature of the
data’s true underlying causes and
the labels are more directly related to
these causes than to the raw inputs. A
subsequent stage of fine-tuning using
backpropagation then yields neural
networks that work much better than
those trained by backpropagation
alone and better than SVMs for im-
portant tasks such as object or speech
1, 2, 4 The neural networks
outperform SVMs because the limited
amount of information in the labels
is not being used to create multiple
features from scratch; it is only being
used to adjust the class boundaries by
slightly modifying the features.
1. bengio, y., lamblin, p., popovici, d. and larochelle,
h. greedy layer-wise training of deep networks.
Advances in Neural Information Processing Systems.
b. schoelkopf, J. platt, and t. hoffman, eds. Mit
press, cambridge, Ma, 2007, 19.
2. dahl, g., Mohamed, a. and hinton, g.e. acoustic
modeling using deep belief networks. IEEE Trans. on
Audio, Speech, and Language Processing 19, 8 (2011).
3. hinton, g.e., osindero, s. and the, y.t. a fast learning
algorithm for deep belief nets. Neural Computation 18
4. hinton, g.e and salakhutdinov, r.r. reducing the
dimensionality of data with neural networks. Science
313 (2006), 504−507.
5. lecun, y., bottou, l., bengio, y. and haffner, p. gradient-
based learning applied to document recognition. in
Proceedings of the IEEE 86, 11 (1998), 2278−2324.
6. lowe, d.g. object recognition from local scale-invariant features. in Proc. International Conference
on Computer Vision, 1999.
7. salakhutdinov, r.s. learning deep generative Models.
phd thesis, university of toronto, 2009.
8. vapnik, v. N. The Nature of Statistical Learning Theory.
springer, New york, Ny, 2000.
Geoffrey E. hinton ( email@example.com) is a professor
of computer science at the university of toronto, canada.