images. The new larger datasets include LabelMe,
consists of hundreds of thousands of fully segmented
images, and ImageNet,
7 which consists of over 15 million
labeled high-resolution images in over 22,000 categories.
To learn about thousands of objects from millions of
images, we need a model with a large learning capacity.
However, the immense complexity of the object recognition task means that this problem cannot be specified even
by a dataset as large as ImageNet, so our model should also
have lots of prior knowledge to compensate for all the data
we do not have. Convolutional neural networks (CNNs) constitute one such class of models.
9, 15, 17, 19, 21, 26, 32 Their capacity
can be controlled by varying their depth and breadth, and
they also make strong and mostly correct assumptions
about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to
standard feedforward neural networks with similarly sized
layers, CNNs have much fewer connections and parameters
and so they are easier to train, while their theoretically best
performance is likely to be only slightly worse.
Despite the attractive qualities of CNNs, and despite the
relative efficiency of their local architecture, they have still
been prohibitively expensive to apply in large scale to high-resolution images. Luckily, current GPUs, paired with a
highly optimized implementation of 2D convolution, are
powerful enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet contain
enough labeled examples to train such models without severe
The specific contributions of this paper are as follows: we
trained one of the largest CNNs to date on the subsets of
ImageNet used in the ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC)-2010 and ILSVRC-2012
competitions2 and achieved by far the best results ever
reported on these datasets. We wrote a highly optimized GPU
implementation of 2D convolution and all the other operations inherent in training CNNs, which we make available
publicly.a Our network contains a number of new and
unusual features which improve its performance and reduce
its training time, which are detailed in Section 4. The size of
our network made overfitting a significant problem, even
with 1. 2 million labeled training examples, so we used several
effective techniques for preventing overfitting, which are
described in Section 5. Our final network contains five convolutional and three fully connected layers, and this depth
seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the
model’s parameters) resulted in inferior performance.
In the end, the network’s size is limited mainly by the
amount of memory available on current GPUs and by the
amount of training time that we are willing to tolerate. Our
network takes between 5 and 6 days to train on two GTX 580
3GB GPUs. All of our experiments suggest that our results can
be improved simply by waiting for faster GPUs and bigger
datasets to become available.
3. THE DA TASET
ImageNet is a dataset of over 15 million labeled high-resolution
images belonging to roughly 22,000 categories. The images
were collected from the web and labeled by human labelers
using Amazon’s Mechanical Turk crowd-sourcing tool. Starting
in 2010, as part of the Pascal Visual Object Challenge, an annual
competition called the ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC) has been held. ILSVRC uses a
subset of ImageNet with roughly 1000 images in each of 1000
categories. In all, there are roughly 1. 2 million training images,
50,000 validation images, and 150,000 testing images.
ILSVRC-2010 is the only version of ILSVRC for which the
test set labels are available, so this is the version on which we
performed most of our experiments. Since we also entered
our model in the ILSVRC-2012 competition, in Section 7 we
report our results on this version of the dataset as well, for
which test set labels are unavailable. On ImageNet, it is customary to report two error rates: top- 1 and top- 5, where the
top- 5 error rate is the fraction of test images for which the
correct label is not among the five labels considered most
probable by the model.
ImageNet consists of variable-resolution images, while
our system requires a constant input dimensionality.
Therefore, we down-sampled the images to a fixed resolution
of 256 × 256. Given a rectangular image, we first rescaled the
image such that the shorter side was of length 256, and then
cropped out the central 256 × 256 patch from the resulting
image. We did not pre process the images in any other way,
except for subtracting the mean activity over the training set
from each pixel. So we trained our network on the (centered)
raw RGB values of the pixels.
4. THE ARCHI TEC TURE
The architecture of our network is summarized in Figure 2. It
contains eight learned layers—five convolutional and three
fully connected. Below, we describe some of the novel or
unusual features of our network’s architecture. Sections 4. 1–
4. 4 are sorted according to our estimation of their importance, with the most important first.
4. 1. Rectified Linear Unit nonlinearity
The standard way to model a neuron’s output f as a function
of its input x is with f(x) = tanh(x) or f(x) = ( 1 + e−x)− 1. In terms
of training time with gradient descent, these saturating
nonlinearities are much slower than the non-saturating
nonlinearity f(x) = max(0, x). Following Nair and Hinton,
we refer to neurons with this non linearity as Rectified
Linear Units (ReLUs). Deep CNNs with ReLUs train several
times faster than their equivalents with tanh units. This is
demonstrated in Figure 1, which shows the number of iterations required to reach 25% training error on the CIFAR- 10
dataset for a particular four-layer convolutional network.
This plot shows that we would not have been able to experiment with such large neural networks for this work if we
had used traditional saturating neuron models.
We are not the first to consider alternatives to traditional
neuron models in CNNs. For example, Jarrett et al.
that the nonlinearity f(x) = |tanh(x)| works particularly well
with their type of contrast normalization followed by local a http://code.google.com/p/cuda-convnet/.