DOI: 10.1145/3065386
ImageNet Classification with Deep
Convolutional Neural Networks
By Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton
Abstract
We trained a large, deep convolutional neural network to
classify the 1. 2 million high-resolution images in the
ImageNet LSVRC-2010 contest into the 1000 different
classes. On the test data, we achieved top- 1 and top- 5 error
rates of 37.5% and 17.0%, respectively, which is considerably
better than the previous state-of-the-art. The neural network,
which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed
by max-pooling layers, and three fully connected layers with a
final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation
of the convolution operation. To reduce overfitting in the
fully connected layers we employed a recently developed regularization method called “dropout” that proved to be very
effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top- 5 test
error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
1. PROLOGUE
Four years ago, a paper by Yann LeCun and his collaborators
was rejected by the leading computer vision conference on
the grounds that it used neural networks and therefore provided no insight into how to design a vision system. At the
time, most computer vision researchers believed that a vision
system needed to be carefully hand-designed using a detailed
understanding of the nature of the task. They assumed that
the task of classifying objects in natural images would never
be solved by simply presenting examples of images and the
names of the objects they contained to a neural network that
acquired all of its knowledge from this training data.
What many in the vision research community failed to
appreciate was that methods that require careful hand-engi-neering by a programmer who understands the domain do
not scale as well as methods that replace the programmer
with a powerful general-purpose learning procedure. With
enough computation and enough data, learning beats programming for complicated tasks that require the integration
of many different, noisy cues.
Four years ago, while we were at the University of Toronto,
our deep neural network called SuperVision almost halved
the error rate for recognizing objects in natural images and
triggered an overdue paradigm shift in computer vision.
Figure 4 shows some examples of what SuperVision can do.
SuperVision evolved from the multilayer neural networks
that were widely investigated in the 1980s. These networks
used multiple layers of feature detectors that were all learned
from the training data. Neuroscientists and psychologists had
hypothesized that a hierarchy of such feature detectors would
provide a robust way to recognize objects but they had no idea
how such a hierarchy could be learned. There was great excite-
ment in the 1980s because several different research groups
discovered that multiple layers of feature detectors could be
trained efficiently using a relatively straight-forward algorithm
called backpropagation18, 22, 27, 33 to compute, for each image,
how the classification performance of the whole network
depended on the value of the weight on each connection.
Backpropagation worked well for a variety of tasks, but in
the 1980s it did not live up to the very high expectations of its
advocates. In particular, it proved to be very difficult to learn
networks with many layers and these were precisely the networks that should have given the most impressive results.
Many researchers concluded, incorrectly, that learning a
deep neural network from random initial weights was just too
difficult. Twenty years later, we know what went wrong: for
deep neural networks to shine, they needed far more labeled
data and hugely more computation.
2. INTRODUC TION
Current approaches to object recognition make essential
use of machine learning methods. To improve their perfor-
mance, we can collect larger datasets, learn more powerful
models, and use better techniques for preventing overfit-
ting. Until recently, datasets of labeled images were rela-
tively small—on the order of tens of thousands of images
(e.g., NORB,
19 Caltech-101/256,
8, 10 and CIFAR-10/10014).
Simple recognition tasks can be solved quite well with datas-
ets of this size, especially if they are augmented with label-
preserving transformations. For example, the current-best
error rate on the MNIST digit-recognition task (<0.3%)
approaches human performance.
5 But objects in realistic
settings exhibit considerable variability, so to learn to recog-
nize them it is necessary to use much larger training sets.
And indeed, the shortcomings of small image datasets have
been widely recognized (e.g., Ref.
25), but it has only recently
become possible to collect labeled datasets with millions of
The original version of this paper was published in
the Proceedings of the 25th International Conference on Neu-
ral Information Processing Systems (Lake Tahoe, NV, Dec.
2012), 1097–1105.