90 COMMUNICATIONS OF THE ACM | JUNE 2017 | VOL. 60 | NO. 6
of poses. We present the results for many more test images in
the supplementary material.
Computing similarity by using Euclidean distance
between two 4096-dimensional, real-valued vectors is inefficient, but it could be made efficient by training an auto
encoder to compress these vectors to short binary codes. This
should produce a much better image retrieval method than
applying auto encoders to the raw pixels,
16 which does not
make use of image labels and hence has a tendency to retrieve
images with similar patterns of edges, whether or not they are
semantically similar.
8. DISCUSSION
Our results show that a large, deep CNN is capable of achieving
record-breaking results on a highly challenging dataset using
purely supervised learning. It is notable that our network’s performance degrades if a single convolutional layer is removed.
For example, removing any of the middle layers results in a loss
of about 2% for the top- 1 performance of the network. So the
depth really is important for achieving our results.
To simplify our experiments, we did not use any unsupervised pre-training even though we expect that it will help, especially if we obtain enough computational power to significantly
increase the size of the network without obtaining a corresponding increase in the amount of labeled data. Thus far, our
results have improved as we have made our network larger and
trained it longer but we still have many orders of magnitude to
go in order to match the infero temporal pathway of the human
visual system. Ultimately we would like to use very large and
deep convolutional nets on video sequences where the temporal structure provides very helpful information, that is, missing
or far less obvious in static images.
9. EPILOGUE
The response of the computer vision community to the success of SuperVision was impressive. Over the next year or two,
they switched to using deep neural networks and these are
now widely deployed by Google, Facebook, Microsoft, Baidu
and many other companies. By 2015, better hardware, more
hidden layers, and a host of technical advances reduced the
error rate of deep convolutional neural nets by a further factor
of three so that they are now quite close to human performance for static images.
11, 31 Much of the credit for this revolution should go to the pioneers who spent many years
developing the technology of CNNs, but the essential missing
ingredient was supplied by FeiFei et al.
7 who put a huge effort
into producing a labeled dataset that was finally large enough
to show what neural networks could really do.
References
1. Bell, R., Koren, Y. Lessons from the
netflix prize challenge. ACM SIGKDD
Explor. Newsl. 9, 2 (2007), 75–79.
2. Berg, A., Deng, J., Fei-Fei, L. Large
scale visual recognition challenge
2010. www.image-net.org/challenges.
2010.
3. Breiman,L.Randomforests. Mach.
Learn. 45, 1 (2001), 5–32.
4. Cireşan, D., Meier, U., Masci, J.,
Gambardella, L., Schmidhuber, J.
High-performance neural networks for
visual object classification. Arxiv
preprint arXiv:1102.0183, 2011.
5. Cireşan, D., Meier, U., Schmidhuber, J.
Multi-column deep neural networks
for image classification. Arxiv preprint
ar Xiv:1202.2745, 2012.
6. Deng, J., Berg, A., Satheesh, S., Su, H.,
Khosla, A., Fei-Fei, L. In ILSVRC-2012
(2012).
7. Deng, J., Dong, W., Socher, R., Li, L.-J.,
Li, K., Fei-Fei, L. ImageNet: A
large-scale hierarchical image
database. In CVPR09 (2009).
8. Fei-Fei, L., Fergus, R., Perona, P.
Learning generative visual models
from few training examples:
An incremental Bayesian approach
tested on 101 object categories.
Comput. Vision Image Understanding
106 , 1 (2007), 59–70.
9. Fukushima, K. Neocognitron: A
self-organizing neural network model
for a mechanism of pattern recognition
unaffected by shift in position. Biol.
Cybern. 36, 4 (1980), 193–202.
10. Griffin, G., Holub, A., Perona, P.
Caltech-256 object category dataset.
Technical Report 7694, California
Institute of Technology, 2007.
11. He, K., Zhang, X., Ren, S., Sun, J. Deep
residual learning for image recognition.
arXiv preprint arXiv:1512.03385, 2015.
12. Hinton, G., Srivastava, N.,
Krizhevsky, A., Sutskever, I.,
Salakhutdinov, R. Improving neural
networks by preventing co-adaptation
of feature detectors. arXiv preprint
arXiv:1207.0580 (2012).
13. Jarrett, K., Kavukcuoglu, K.,
Ranzato, M.A., LeCun, Y. What is the
best multi-stage architecture for
object recognition? In International
Conference on Computer Vision
(2009). IEEE, 2146–2153.
14. Krizhevsky, A. Learning multiple layers
of features from tiny images. Master’s
thesis, Department of Computer
Science, University of Toronto, 2009.
15. Krizhevsky, A. Convolutional deep
belief networks on cifar- 10.
Unpublished manuscript, 2010.
16. Krizhevsky, A., Hinton, G. Using very
deep autoencoders for content-based
image retrieval. In ESANN (2011).
17. LeCun, Y., Boser, B., Denker, J.,
Henderson, D., Howard, R., Hubbard, W.,
Jackel, L., et al. Handwritten digit
recognition with a back-propagation
network. In Advances in Neural
Information Processing Systems (1990).
18. LeCun, Y. Une procedure
d’apprentissage pour reseau a seuil
asymmetrique (a learning scheme for
asymmetric threshold networks). 1985.
19. Le Cun, Y., Huang, F., Bottou, L.
Learning methods for generic object
recognition with invariance to pose and
lighting. In Proceedings of the 2004
IEEE Computer Society Conference on
Computer Vision and Pattern
Recognition, 2004, CVPR 2004.
Volume 2 (2004). IEEE, II– 97.
20. Le Cun, Y., Kavukcuoglu, K., Farabet, C.
Convolutional networks and
applications in vision. In Proceedings
of 2010 IEEE International
Symposium on Circuits and Systems
(ISCAS) (2010). IEEE, 253–256.
21. Lee, H., Grosse, R., Ranganath, R., Ng,
A. Convolutional deep belief
networks for scalable unsupervised
learning of hierarchical
representations. In Proceedings of
the 26th Annual International
Conference on Machine Learning
(2009). ACM, 609–616.
22. Linnainmaa, S. Taylor expansion of the
accumulated rounding error. BIT
Numer. Math. 16, 2 (1976), 146–160.
23. Mensink, T., Verbeek, J., Perronnin, F.,
Csurka, G. Metric learning for large
scale image classification:
Generalizing to new classes at
near-zero cost. In ECCV – European
Conference on Computer Vision
(Florence, Italy, Oct. 2012).
24. Nair, V., Hinton, G. E. Rectified linear
units improve restricted Boltzmann
machines. In Proceedings of the 27th
International Conference on Machine
Learning (2010).
25. Pinto, N., Cox, D., DiCarlo, J. Why is
real-world visual object recognition
hard? PLoS Comput. Biol. 4, 1 (2008),
e27.
26. Pinto, N., Doukhan, D., DiCarlo, J., Cox,
D. A high-throughput screening
approach to discovering good forms of
biologically inspired visual
representation. PLoS Comput. Biol. 5,
11 (2009), e1000579.
27. Rumelhart, D. E., Hinton, G. E., Williams,
R.J. Learning internal representations
by error propagation. Technical report,
DTIC Document, 1985.
28. Russell, BC, Torralba, A., Murphy, K.,
Freeman, W. Labelme: A database and
web-based tool for image annotation.
Int. J. Comput Vis. 77, 1 (2008),
157–173.
29. Sánchez, J., Perronnin, F. High-dimensional signature compression for
large-scale image classification. In
IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2011
(2011). IEEE, 1665–1672.
30. Simard, P., Steinkraus, D., Platt, J. Best
practices for convolutional neural
networks applied to visual document
analysis. In Proceedings of the
Seventh International Conference on
Document Analysis and Recognition.
Volume 2 (2003), 958–962.
31. Szegedy, C., Liu, W., Jia, Y., Sermanet,
P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.
Going deeper with convolutions.
In Proceedings of the IEEE Conference
on Computer Vision and Pattern
Recognition (2015), 1–9.
32. Turaga, S., Murray, J., Jain, V., Roth, F.,
Helmstaedter, M., Briggman, K., Denk,
W., Seung, H. Convolutional networks
can learn to generate affinity graphs
for image segmentation. Neural
Comput. 22, 2 (2010), 511–538.
33. Werbos, P. Beyond regression: New
tools for prediction and analysis in
the behavioral sciences, 1974.
Alex Krizhevsky and Geoffrey E. Hinton
({akrizhevsky, geoffhinton}@ google.com),
Google Inc.
Ilya Sutskever ( ilyasu@openai.com),
OpenAI.
Copyright held by Authors/Owners.