tostreams of individuals represent
multiple facets of recorded visual information, from remembering moments and storytelling to social communication and self-identity. 19 How
to preserve digital culture is a grand
challenge of sensemaking and understanding digital archives from nonhomogeneous sources. Photographers
and curators alike have contributed
to the larger collection of Creative
Commons images, yet little is know
of how such archives will be navigated
and retrieved or how new information can be discovered therein. The
YFCC100M dataset offers avenues of
computer science research in multimedia, information retrieval, and
data visualization, in addition to the
larger questions of how to preserve
Data is a core component of research
and development in all scientific fields.
In the field of multimedia, datasets are
usually created for a single purpose
and, as such, lack reusability. Moreover, datasets generally are not or may
not be freely shared with others and, as
such, also lack reproducibility, transparency, and accountability. That is
why we released one of the largest datasets ever created, with 100 million
media objects, published under a Creative Commons license. We curated the
YFCC100M dataset to be comprehensive and representative of real-world
photography, expansive and expandable in coverage, free and legal to use,
and intentionally consolidate and supplant many existing collections. The
YFCC100M dataset encourages improvement and validation of research
methods, reduces the effort to acquire
data, and stimulates innovation and
potential new data uses. We have further provided rules on how the dataset
should be used to comply with licensing, attribution, and copyright and offered guidelines on how to maximize
compatibility and promote reproducibility of experiments with existing and
We thank Jordan Gimbel and Kim
Capps-Tanaka at Yahoo, Pierre Gar-rigues, Simon Osindero, and the rest
of the Flickr Vision & Search team,
Carmen Carrano and Roger Pearce at
Lawrence Livermore National Laboratory, and Julia Bernd, Jaeyoung Choi,
Luke Gottlieb, and Adam Janin at the
International Computer Science Institute (ICSI). We are further thankful to ICSI for making its data publicly available in collaboration with
Amazon. Portions of this work were
performed under the auspices of the
U.S. Department of Energy by Lawrence Livermore National Laboratory
under Contract DE-AC52-07NA27344
and supported by the National Science Foundation by ICSI under Award
1. Bernd, J., Borth, D., Elizalde, B., Friedland, G.,
Gallagher, H., Gottlieb, L.R., Janin, A., Karabashlieva,
S., Takahashi, J., and Won, J. The YLI-MED corpus:
Characteristics, procedures, and plans. Computing
Research Repository Division of arXiv abs/1503.04250
2. Borgman, C.L. The conundrum of sharing research
data. Journal of the American Society for Information
Science and Technology 63, 6 (Apr. 2012), 1059–1078.
3. Choi, J., Thomee, B., Friedland, G., Cao, L., Ni, K.,
Borth, D., Elizalde, B., Gottlieb, L., Carrano, C., Pearce,
R., and Poland, D. The placing task: A large-scale
geo-estimation challenge for social-media videos
and images. In Proceedings of the Third ACM
International Workshop on Geotagging and Its
Applications in Multimedia (Orlando, FL, Nov. 3–7).
ACM Press, New York, 2014, 27–31.
4. Crandall, D. J., Backstrom, L., Huttenlocher, D.,
and Kleinberg, J. Mapping the world’s photos.
In Proceedings of the 18th I W3C2 International
Conference on the World Wide Web (Madrid, Spain,
Apr. 20–24). ACM Press, New York, 2009, 761–770.
5. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image
database. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (Miami, FL,
June 20–25). IEEE Press, New York, 2009. 248–255.
6. Facebook, Ericsson, and Qualcomm. A Focus on
Efficiency. Technical Report, Internet.org, 2013;
7. Fienberg, S.E., Martin, M. E., and Straf, M. L. Eds.
(National Research Council). Sharing Research Data.
National Academy Press, Washington, D. C., 1985; http://
8. Good, J. How many photos have ever been taken?.
Internet Archive Wayback Machine, Sept. 2011;
9. Hays, J. and Efros, A.A. IM2GPS: Estimating
geographic information from a single image. In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (Anchorage, AK, June
23–28). IEEE Press, New York, 2008.
10. Hecht, B., Hong, L., Suh, B., and Chi, E. H. Tweets from
Justin Bieber’s heart: The dynamics of the location
field in user profiles. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems
(Vancouver, Canada, May 7–12). ACM Press, New York,
11. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long,
J., Girshick, R. B., Guadarrama, S., and Darrell, T. Caffe:
Convolutional architecture for fast feature embedding.
In Proceedings of the 22nd ACM International
Conference on Multimedia (Orlando, FL, Nov. 3–7).
ACM Press, New York, 2014, 675–678.
12. Kremerskothen, K. Welcome the Internet archive to
the commons. Flickr, San Francisco, CA, Aug. 2014;
13. Krizhevsky, A., Sutskever, I., and Hinton, G.E.
ImageNet classification with deep convolutional
neural networks. In Proceedings of Advances in
Neural Information Processing Systems (Lake Tahoe,
CA, Dec 3–8). Curran Associates, Red Hook, NY, 2012,
14. Li, L., Socher, R., and Fei-Fei, L. Towards total
scene understanding: Classification, annotation
and segmentation in an automatic framework. In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (Miami, FL, June
20–25). IEEE Press, New York, 2009, 2036–2043.
15. Rattenbury, T., Good, N., and Naaman, M. Towards
automatic extraction of event and place semantics
from Flickr tags. In Proceedings of the 30th
ACM International Conference on Research and
Development in Information Retrieval (Amsterdam,
the Netherlands, July 23–27). ACM Press, New York,
16. Renear, A.H., Sacchi, S., and Wickett, K.M. Definitions
of dataset in the scientific and technical literature.
In Proceedings of the 73rd Annual Meeting of the
American Society for Information Science and
Technology (Pittsburgh, PA, Oct. 22–27). Association
for Information Science and Technology, Silver Spring,
MD, 2010, article 81.
17. Snavely, N., Seitz, S., and Szeliski, R. Photo tourism:
Exploring photo collections in 3D. ACM Transactions
on Graphics 25, 3 (July 2006), 835–846.
18. Swan, A. and Brown, S. To Share or Not to Share:
Publication and Quality Assurance of Research Data
Outputs. Technical Report. Research Information
Network, London, U. K., 2008.
19. Van Dijck, J. Digital photography: Communication,
identity, memory. Visual Communication 7, 1 (Feb.
20. Wilson, M.L., Chi, E.H., Reeves, S., and Coyle, D. RepliCHI:
The workshop II. In Proceedings of the International
Conference on Human Factors in Computing Systems,
Extended Abstracts (Toronto, Canada, Apr. 26–May 1).
ACM Press, New York, 2014, 33–36.
21. Yelp. Yelp Dataset Challenge. Yelp, San Francisco, CA;
22. You Tube. You Tube press statistics. You Tube, San
Bruno, CA; http://youtube.com/yt/press/statistics.html
Bart Thomee ( email@example.com) is a senior
research scientist in the HCI Research Group at Yahoo
Labs and Flickr in San Francisco, CA.
David A. Shamma ( firstname.lastname@example.org) is director of the
HCI Research Group at Yahoo Labs and Flickr in San
Gerald Friedland ( email@example.com) is director
of the Audio and Multimedia Lab at the International
Computer Science Institute in Berkeley, CA.
Benjamin Elizalde ( firstname.lastname@example.org) is a
Ph.D. student at Carnegie Mellon University in Mountain
View, CA; this work was done while he was at the
International Computer Science Institute in Berkeley, CA.
Karl Ni ( email@example.com) is a program lead and senior data
scientist at In-Q-Tel’s Lab41 in Menlo Park, CA; this work
was done while he was at Lawrence Livermore National
Laboratory in Livermore, CA.
Douglas Poland ( firstname.lastname@example.org) is a principal
investigator at the Lawrence Livermore National
Laboratory in Livermore, CA.
Damian Borth ( email@example.com) is head of
the Multimedia Analysis & Data Mining Group at the
German Research Center for Artificial Intelligence in
Kaiserslautern, Germany; this work was done while he
was at the International Computer Science Institute in
Li-Jia Li ( firstname.lastname@example.org) is head of research at
Snapchat, Venice, CA; this work was done while she was
at Yahoo Labs, San Francisco, CA.
Copyright held by authors.
Publication rights licensed to ACM $15.00.
Watch the authors discuss
their work in this exclusive