Doi: 10.1145/1400181.1400202
Scene Completion Using Millions
of Photographs
By James Hays and Alexei A. Efros
abstract
What can you do with a million images? In this paper, we
present a new image completion algorithm powered by a
huge database of photographs gathered from the Web. The
algorithm patches up holes in images by finding similar image regions in the database that are not only seamless, but
also semantically valid. Our chief insight is that while the
space of images is effectively infinite, the space of semantically differentiable scenes is actually not that large. For
many image completion tasks, we are able to find similar
scenes which contain image fragments that will convincingly complete the image. Our algorithm is entirely data
driven, requiring no annotations or labeling by the user.
Unlike existing image completion methods, our algorithm
can generate a diverse set of image completions and we allow users to select among them. We demonstrate the superiority of our algorithm over existing image completion
approaches.
1. intRoDuction
Every once in a while, we all wish we could erase something
from our old photographs. A garbage truck right in the middle of a charming Italian piazza, an ex-boyfriend in a family
photo, a political ally in a group portrait who has fallen out
of favor. 13 Other times, there is simply missing data in some
areas of the image: (a) an aged corner of an old photograph
(b) a hole in an image-based 3D reconstruction due to occlusion, and (c) a dead bug on the camera lens. Image completion (also called inpainting or hole-filling) is the task of
filling in or replacing an image region with new image data
such that the modification cannot be detected.
There are two fundamentally different strategies for image completion. The first aims to reconstruct, as accurately
as possible, the data that should have been there, but somehow got occluded or corrupted. Methods attempting an accurate reconstruction have to use some other source of data
in addition to the input image (Figure 1), such as video (
using various background stabilization techniques) or multiple photographs of the same scene. 1, 19
The alternative is to try finding a plausible way to fill in
the missing pixels, hallucinating data that could have been
there. This is a much less easily quantifiable endeavor, relying instead on the studies of human visual perception.
The most successful existing methods4, 6, 24, 25 operate by extending adjacent textures and contours into the unknown
region. These algorithms are similar to texture synthesis
algorithms such as, 8, 7, 14, 15 sometimes with additional constraints to explicitly preserve Gestalt cues such as good continuation, 23 either automatically 4 or by hand. 20 Importantly,
all of the existing image completion methods operate by
filling in the unknown region with content from the known
parts of the input source image.
Searching the source image for usable texture makes a
lot of sense. The source image often has textures at just the
right scale, orientation, and illumination as needed to seamlessly fill in the unknown region. Some methods6, 25 search
additional scales and orientations to gain additional source
texture samples. However, viewing image completion as
constrained texture synthesis limits the type of completion
tasks that can be tackled. The assumption present in all of
these methods is that all the necessary image data to fill in
an unknown region is located somewhere else in that same
image. We believe this assumption is flawed and that the
source image simply does not provide enough data except
for trivial image completion tasks.
Typical demonstrations of previously published algorithms are object removal tasks such as removing people,
signs, horses, or cars from relatively simple backgrounds.
The results tend to be fairly sterile images because the algorithms are only reusing image content that appeared somewhere else in the same image. For situations in which the incomplete region is not bounded by texture regions, or when
figure 1: Given an input image with a missing region, we use matching scenes from a large collection of photographs to complete the image.
Original image
Input
Scene matches
Output