JUNE 2017 | VOL. 60 | NO. 6 | COMMUNICATIONS OF THE ACM 87
of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 × 3 × 192, and the fifth
convolutional layer has 256 kernels of size 3 × 3 × 192. The
fully connected layers have 4096 neurons each.
5. REDUCING OVERFI TTING
Our neural network architecture has 60 million parameters.
Although the 1000 classes of ILSVRC make each training
example impose 10 bits of constraint on the mapping from
image to label, this turns out to be insufficient to learn so
many parameters without considerable overfitting. Below,
we describe the two primary ways in which we combat
overfitting.
5. 1. Data augmentation
The easiest and most common method to reduce overfitting
on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., Refs.
4, 5, 30). We employ two
distinct forms of data augmentation, both of which allow
transformed images to be produced from the original images
with very little computation, so the transformed images do
not need to be stored on disk. In our implementation, the
transformed images are generated in Python code on the CPU
while the GPU is training on the previous batch of images. So
these data augmentation schemes are, in effect, computationally free.
The first form of data augmentation consists of generating
image translations and horizontal reflections. We do this by
extracting random 224 × 224 patches (and their horizontal
reflections) from the 256 × 256 images and training our network on these extracted patches.d This increases the size of
our training set by a factor of 2048, though the resulting training examples are, of course, highly inter dependent. Without
this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by
extracting five 224 × 224 patches (the four corner patches and
the center patch) as well as their horizontal reflections (hence
10 patches in all), and averaging the predictions made by the
network’s softmax layer on the ten patches.
pooling. This is what we use throughout our network, with s =
2 and z = 3. This scheme reduces the top- 1 and top- 5 error
rates by 0.4% and 0.3%, respectively, as compared with the
non overlapping scheme s = 2, z = 2, which produces output of
equivalent dimensions. We generally observe during training
that models with overlapping pooling find it slightly more difficult to overfit.
4. 5. Overall architecture
Now we are ready to describe the overall architecture of
our CNN. As depicted in Figure 2, the net contains eight
layers with weights; the first five are convolutional and the
remaining three are fully connected. The output of the last
fully connected layer is fed to a 1000-way softmax which
produces a distribution over the 1000 class labels. Our network maximizes the multinomial logistic regression
objective, which is equivalent to maximizing the average
across training cases of the log-probability of the correct
label under the prediction distribution.
The kernels of the second, fourth, and fifth convolutional
layers are connected only to those kernel maps in the previous
layer which reside on the same GPU (see Figure 2). The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the
fully-connected layers are connected to all neurons in the
previous layer. Response-normalization layers follow the first
and second convolutional layers. Max-pooling layers, of the
kind described in Section 4. 4, follow both response-normalization layers as well as the fifth convolutional layer. The
ReLU non linearity is applied to the output of every convolutional and fully connected layer.
The first convolutional layer filters the 224 × 224 × 3
input image with 96 kernels of size 11 × 11 × 3 with a stride
of 4 pixels (this is the distance between the receptive field
centers of neighboring neurons in a kernel map). The second convolutional layer takes as input the (
response-normalized and pooled) output of the first convolutional layer
and filters it with 256 kernels of size 5 × 5 × 48. The third,
fourth, and fifth convolutional layers are connected to one
another without any intervening pooling or normalization
layers. The third convolutional layer has 384 kernels of size
3 × 3 × 256 connected to the (normalized, pooled) outputs d This is the reason why the input images in Figure 2 are 224 × 224 × 3 dimensional.
48
3
Stride
of 4
Max
pooling
Max
pooling
Max
pooling
55
11
224
224
11
55
5
48
5
27
27
128
128
3
192
192 192
192
3
3
3
3
3
3
33
3
3
3
3
13
13
13
13
128
128 2048
2048 2048
1000
2048 dense
dense dense
13
13
3
3
5
5
11
11
Figure 2. An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU
runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers.
The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 290,400–186,624–
64,896–64,896–43,264–4096–4096–1000.