nity of photographers that differs from
the overall Flickr user base.
Licenses. The licenses themselves
vary by CC type, with approximately
31.8% of the dataset marked appro-
priate for commercial use and 17.3%
assigned the most liberal license re-
quiring attribution for only the pho-
tographer who took the photo (see
Content. The YFCC100M dataset
includes a diverse collection of complex real-world scenes, ranging from
200,000 street-life-blogged photos
by photographer Andy Nystrom (see
Figure 3a) to snapshots of daily life,
holidays, and events (see Figure 3b).
To understand more about the visual
content represented in the dataset,
we used a deep-learning approach to
detect a variety of concepts (such as
Table 1. Popular multimedia datasets used by the research community. When various versions of a particular collection are available,
we generally include only the most recent one. PASCAL, TRECVID, MediaEval, and ImageCLEF are recurring annual benchmarks that
consist of one or more challenges, each with its own dataset; here, we report the total number of media objects aggregated over all
datasets that are part of the most recent edition of each benchmark.
Year Dataset Type Image Video Audio License Accessibility Content
1966Brodatz texture <1K - - ©
1996COIL- 100 object 7K - - b
1996Corel stock 60K - - ©
2000FERET face 14K - - ©
2005 Yale Face B+ face 16K - - ©
2005Ponce texture 1K - - ©
2007Caltech-256 object 30K - - b
2007Oxford buildings 5K - - ©
2008CMUMulti-PIE face 750K - - ©
2008Tiny Images web 80M - - ©
2008MIRFLICKR-25KFlickr 25K - - c
2009 NUS-WIDE Flickr 270K - - ©
2009ImageNet web 14M - - ©
2010SUN web 131K - - ©
2010MIRFLICKR-1M Flickr 1M - - c
2012PASCAL Flickr 23K - - © *
2013MS Clickture web 40M - - © **
2014 Sports-1M sports - 1M - c
2014MS COCO Flickr 330K - - c
2014 YFCC100M Flickr 99M 800K - c ****
2015 TRECVID mixed - 220K - ©
2015 MediaEval mixed 6M 51K 1,4K © *****
2015ImageCLEF mixed 500K - - ©
The icons represent the following:
© Some or all content in dataset is
Dataset can only be obtained by accepting
a license agreement.
Dataset contains generated features. Dataset contains locations.
c All content in dataset has a Creative
Dataset can only be obtained after creating
Dataset contains subtitles, transcripts,
or captions describing the content.
Dataset contains tags.
b Content in dataset can be freely used
on condition of citing the dataset paper.
Dataset can only be obtained by participating
in a benchmark competition.
Dataset contains search engine
click log data.
Dataset contains object bounding boxes.
Dataset has to be downloaded. Dataset contains URLs to the content
instead of the content itself.
Dataset contains user information. Dataset contains object segmentations.
Dataset has to be purchased. / Dataset is fully/partially annotated
with class labels.
Dataset contains camera information. Dataset is still evolving.
Dataset is delivered by mail. Dataset contains content found by querying
search engines with dictionary words.
Dataset contains timestamps. Dataset changes from year to year.
* The PASCAL training and development data can be freely downloaded, but the test data requires registration.
** Reduced-resolution images are included in the dataset, while full-resolution images must be downloaded separately.
*** The Sports-1M dataset has a CC license, though the videos it links to are hosted on You Tube and copyrighted.
**** The photos and videos have been uploaded to the cloud; like the metadata, the photo and video data can be mounted
as a read-only network drive or downloaded.
***** Most MediaEval challenges include the media objects in their dataset, though some provide only URLs; in previous editions
data had to be purchased and delivered by postal mail in order to participate in certain challenges.