figure 6: typical failure cases. some results exhibit pronounced
texture seams (top). others are failures of scene matching (middle).
the last failure mode (bottom), shared with traditional image completion algorithms, is a failure to adhere to high-level semantics (e.g.,
entire people).
Input
Output
Scene match
figure 7: situations where existing image completion algorithms
perform better than our algorithm.
Input
Criminisi et al.
Our algorithm
submit an incomplete photo and a remote service would
search a massive database, in parallel, and return results.
7. towaRD BRute-foRce imaGe unDeRstanDinG
Beyond the particular graphics application, the deeper question for all appearance-based data-driven methods is this:
would it be possible to ever have enough data to represent
the entire visual world? Clearly, attempting to gather all possible images of our dynamic world is a futile task, but what
about collecting the set of all semantically differentiable
scenes? That is, given any input image can we find a scene
that is “similar enough” under some metric? The truly exciting (and surprising!) result of our work is that not only does
it seem possible, but the number of required images might
not be astronomically large. This paper, along with work by
figure 8: the percentage of images marked as fake within any
amount of viewing time by participants in our perceptual study.
1
Criminisi et al.
0.9
Percentage of images marked fake
0.8
0.7
Our algorithm
0.6
0.5
0.4
0.3
0.2
Real photographs
0.1
0
0 10 20 70 80 90
30 40 50 60
Maximum response time (s)
Torralba et al., 21 suggest the feasibility of sampling from the
entire space of scenes as a way of exhaustively modeling our
visual world. This, in turn, might allow us to “brute force”
many currently unsolvable vision and graphics problems!
Further supporting this possibility, we recently used
scene matching methods similar to those presented here
to estimate the GPS location of an arbitrary image. In a
project called IM2GPS, 10 we collect a database of 6 million
geotagged photographs from Flickr and show that image
matches for a query photo are often “similar enough” to be
geographically informative even if we do not match to the exact, real-world location. We represent the estimated image
location as a probability distribution over the Earth’s surface (see Figure 9). We quantitatively evaluate our approach
in several geolocation tasks and demonstrate encouraging
performance (up to 30 times better than chance). We show
that geolocation estimates can provide the basis for numerous other image understanding tasks such as population
density estimation, land cover estimation or urban/rural
classification (see10 for details).
figure 9: Geolocation estimates for photos of the Grand canyon and
a generic european alley. from left to right are the query photographs, the first 16 nearest scene matches, and the distribution of
the top 120 nearest-neighbors across the earth. Geographic clusters
are marked by x’s with size proportional to cluster cardinality. the
ground truth locations of the queries are surrounded by concentric
green circles.