0.75
Figure 1. A four-layer convolutional neural network with ReLUs
(solid line) reaches a 25% training error rate on CIFAR-10 six times
faster than an equivalent network with tanh neurons (dashed line).
The learning rates for each network were chosen independently
to make training as fast as possible. No regularization of any
kind was employed. The magnitude of the effect demonstrated
here varies with network architecture, but networks with ReLUs
consistently learn several times faster than equivalents with
saturating neurons.
Training error rate
0.25
0.5
0
0
5
10 15 20
25 30 35 40
Epochs
average pooling on the Caltech-101 dataset. However, on this
dataset the primary concern is preventing overfitting, so the
effect they are observing is different from the accelerated
ability to fit the training set which we report when using
ReLUs. Faster learning has a great influence on the performance of large models trained on large datasets.
4.2. Training on multiple GPUs
A single G TX 580 GPU has only 3GB of memory, which limits
the maximum size of the networks that can be trained on it.
It turns out that 1.2 million training examples are enough
to train networks which are too big to fit on one GPU.
Therefore we spread the net across two GPUs. Current GPUs
are particularly well-suited to cross-GPU parallelization, as
they are able to read from and write to one another’s memory directly, without going through host machine memory.
The parallelization scheme that we employ essentially puts
half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers.
This means that, for example, the kernels of layer 3 take
input from all kernel maps in layer 2. However, kernels in
layer 4 take input only from those kernel maps in layer 3
which reside on the same GPU. Choosing the pattern of connectivity is a problem for cross-validation, but this allows us
to precisely tune the amount of communication until it is
an acceptable fraction of the amount of computation.
The resultant architecture is somewhat similar to that of
the “columnar” CNN employed by Cireşan et al.,4 except that
our columns are not independent (see Figure 2). This scheme
reduces our top-1 and top-5 error rates by 1.7% and 1.2%,
respectively, as compared with a net with half as many kernels
in each convolutional layer trained on one GPU. The two-GPU
86 COMMUNICATIONS OF THE ACM | JUNE 2017 | VOL. 60 | NO. 6
net takes slightly less time to train than the one-GPU net.b
4.3. Local response normalization
ReLUs have the desirable property that they do not require
input normalization to prevent them from saturating. If at
least some training examples produce a positive input to a
ReLU, learning will happen in that neuron. However, we still
find that the following local normalization scheme aids generalization. Denoting by aix, y the activity of a neuron computed by applying kernel i at position (x, y) and then applying
the ReLU nonlinearity, the response-normalized activity bix, y
is given by the expression
where the sum runs over n “adjacent” kernel maps at the
same spatial position, and N is the total number of kernels in
the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of
response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities among neuron outputs computed
using different kernels. The constants k, n, α, and β are hyper-parameters whose values are determined using a validation
set; we used k = 2, n = 5, α = 10−4, and β = 0.75. We applied this
normalization after applying the ReLU nonlinearity in certain layers (see Section 4.5).
This scheme bears some resemblance to the local contrast
normalization scheme of Jarrett et al.,13 but ours would be
more correctly termed “brightness normalization,” since we
do not subtract the mean activity. Response normalization
reduces our top-1 and top-5 error rates by 1.4% and 1.2%,
respectively. We also verified the effectiveness of this scheme
on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test
error rate without normalization and 11% with
normalization.c
4.4. Overlapping pooling
Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally,
the neighborhoods summarized by adjacent pooling units do
not overlap (e.g., Refs.5, 13, 20). To be more precise, a pooling
layer can be thought of as consisting of a grid of pooling units
spaced s pixels apart, each summarizing a neighborhood of
size z × z centered at the location of the pooling unit. If we set
s = z, we obtain traditional local pooling as commonly
employed in CNNs. If we set s < z, we obtain overlapping
b The one-GPU net actually has the same number of kernels as the two-GPU
net in the final convolutional layer. This is because most of the net’s param-
eters are in the first fully connected layer, which takes the last convolutional
layer as input. So to make the two nets have approximately the same num-
ber of parameters, we did not halve the size of the final convolutional layer
(nor the fully connected layers which follow). Therefore this comparison is
biased in favor of the one-GPU net, since it is bigger than “half the size” of
the two-GPU net.
c We cannot describe this network in detail due to space constraints, but it
is specified precisely by the code and parameter files provided here: http://
code.google.com/p/cuda-convnet/.