The second form of data augmentation consists of altering
the intensities of the RGB channels in training images.
Specifically, we perform PCA on the set of RGB pixel values
throughout the ImageNet training set. To each training
image, we add multiples of the found principal components,
with magnitudes proportional to the corresponding eigen
values times a random variable drawn from a Gaussian with
mean 0 and standard deviation 0.1. Therefore to each RGB
image pixel Ixy = [IRxy, IGxy, IBxy]T we add the following quantity:
[p1, p2, p3] [α 1λ1, α 2λ2, α 3λ3] T,
where pi and λi are ith eigenvector and eigenvalue of the 3 × 3
covariance matrix of RGB pixel values, respectively, and αi is
the aforementioned random variable. Each αi is drawn only
once for all the pixels of a particular training image until that
image is used for training again, at which point it is re drawn.
This scheme approximately captures an important property
of natural images, namely, that object identity is invariant to
changes in the intensity and color of the illumination. This
scheme reduces the top- 1 error rate by over 1%.
5. 2. Dropout
Combining the predictions of many different models is a very
successful way to reduce test errors,
1, 3 but it appears to be too
expensive for big neural networks that already take several
days to train. There is, however, a very efficient version of
model combination that only costs about a factor of two during training. The recently introduced technique, called
12 consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are
“dropped out” in this way do not contribute to the forward
pass and do not participate in back propagation. So every
time an input is presented, the neural network samples a different architecture, but all these architectures share weights.
This technique reduces complex co adaptations of neurons,
since a neuron cannot rely on the presence of particular other
neurons. It is, therefore, forced to learn more robust features
that are useful in conjunction with many different random
subsets of the other neurons. At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable
approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.
We use dropout in the first two fully connected layers of
Figure 2. Without dropout, our network exhibits substantial
overfitting. Dropout roughly doubles the number of iterations required to converge.
6. DETAILS OF LEARNING
We trained our models using stochastic gradient descent
with a batch size of 128 examples, momentum of 0.9, and
weight decay of 0.0005. We found that this small amount of
weight decay was important for the model to learn. In other
words, weight decay here is not merely a regularizer: it
reduces the model’s training error. The update rule for weight
where i is the iteration index, u is the momentum variable, ε is
the learning rate, and 〈 wi〉Di is the average over the ith batch
Di of the derivative of the objective with respect to w, evaluated
We initialized the weights in each layer from a zero-mean
Gaussian distribution with standard deviation 0.01. We initialized the neuron biases in the second, fourth, and fifth
convolutional layers, as well as in the fully connected hidden
layers, with the constant 1. This initialization accelerates the
early stages of learning by providing the ReLUs with positive
inputs. We initialized the neuron biases in the remaining layers with the constant 0.
We used an equal learning rate for all layers, which we
adjusted manually throughout training. The heuristic which
we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning
rate. The learning rate was initialized at 0.01 and reduced
three times prior to termination. We trained the network for
roughly 90 cycles through the training set of 1. 2 million
images, which took 5–6 days on two NVIDIA GTX 580 3GB
7. RESUL TS
Our results on ILSVRC-2010 are summarized in Table 1. Our
network achieves top- 1 and top- 5 test set error rates of
17.0%, respectively.e The best performance achieved during the ILSVRC-2010 competition was 47.1% and 28.2% with
an approach that averages the predictions produced from six
sparse-coding models trained on different features,
since then the best published results are 45.7% and 25.7%
with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from t wo types
of densely sampled features.
We also entered our model in the ILSVRC-2012 competition and report our results in Table 2. Since the ILSVRC-2012
test set labels are not publicly available, we cannot report test
Model Top- 1 (%) Top- 5 (%)
47. 1 28. 2
SIFT + FVs29
45. 7 25. 7
37. 5 17.0
Table 1. Comparison of results on ILSVRC-2010 test set.
In italics are best results achieved by others.
e The error rates without averaging predictions over 10 patches as described
in Section 5. 1 are 39.0% and 18.3%.
Model Top- 1 (val, %) Top- 5 (val, %) Top- 5 (test, %)
SIFT + FVs6 – –
1 CNN 40. 7 18. 2 –
5 CNNs 38. 1 16. 4
1 CNN* 39.0 16. 6 –
7 CNNs* 36. 7 15. 4
Table 2. Comparison of error rates on ILSVRC-2012 validation and test sets.
In italics are best results achieved by others. Models with an “*” were “pre-trained” to classify
the entire ImageNet 2011 Fall release (see Section 7 for details).