JUNE 2017 | VOL. 60 | NO. 6 | COMMUNICATIONS OF THE ACM 89
color-specific. This kind of specialization occurs during every
run and is independent of any particular random weight initialization (modulo a renumbering of the GPUs).
In the left panel of Figure 4 we qualitatively assess what the
network has learned by computing its top- 5 predictions on
eight test images. Notice that even off-center objects, such as
the mite in the top-left, can be recognized by the net. Most of
the top- 5 labels appear reasonable. For example, only other
types of cat are considered plausible labels for the leopard. In
some cases (grille, cherry) there is genuine ambiguity about
the intended focus of the photograph.
Another way to probe the network’s visual knowledge is to
consider the feature activations induced by an image at the
last, 4096-dimensional hidden layer. If two images produce
feature activation vectors with a small Euclidean separation,
Finally, we also report our error rates on the Fall 2009 ver-
sion of ImageNet with 10,184 categories and 8. 9 million
images. On this dataset we follow the convention in the litera-
ture of using half of the images for training and half for test-
ing. Since there is no established test set, our split necessarily
differs from the splits used by previous authors, but this does
not affect the results appreciably. Our top- 1 and top- 5 error
rates on this dataset are
67.4% and
40.9%, attained by the net
described above but with an additional, sixth convolutional
layer over the last pooling layer. The best published results on
this dataset are 78.1% and 60.9%.
23
7. 1. Qualitative evaluations
Figure 3 shows the convolutional kernels learned by the network’s two data-connected layers. The network has learned a
variety of frequency- and orientation-selective kernels, as well
as various colored blobs. Notice the specialization exhibited
by the two GPUs, a result of the restricted connectivity
described in Section 4. 5. The kernels on GPU 1 are largely
color-agnostic, while the kernels on on GPU 2 are largely
Figure 3. Ninety-six convolutional kernels of size 11 × 11 × 3 learned
by the first convolutional layer on the 224 × 224 × 3 input images.
The top 48 kernels were learned on GPU 1 while the bottom 48
kernels were learned on GPU 2 (see Section 7. 1 for details).
Figure 4. (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written
under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5). (Right) Five
ILSVRC-2010 test images in the first column. The remaining columns show the six training images that produce feature vectors in the last
hidden layer with the smallest Euclidean distance from the feature vector for the test image.