figure 2: Results from image completion algorithms including microsoft Digital image Pro smart erase.
Original image
Input
Criminisi et al. 2003
Smart erase
Wilczkowiak et al.
Our algorithm
there is too little useful texture, existing algorithms have
trouble completing scenes (Figure 2).
2. oVeRView
In this paper, we perform image completions by leveraging
a massive database of images. There are two compelling
reasons to expand the search for image content beyond the
source image. ( 1) In many cases, a region will be impossible
to fill plausibly using only image data from the source image.
For instance, if the roof of a house is missing or the entire
sky is masked out. ( 2) Even if there is suitable content in the
source image, reusing that content would often leave obvious duplications in the final image, e.g. replacing a missing
building with another building in the image. By performing
hole filling with content from other images, entirely novel
objects and textures can be inserted.
However, there are several challenges with drawing content
from other images. The first challenge is computational. Even
in the single image case some existing methods report running
times in the hours6, 8 because of the slow texture search. Texture synthesis-based image completion methods are difficult
to accelerate with traditional nearest-neighbor or approximate
nearest-neighbor methods because of the high dimensionality
of the features being searched and because the known dimensions of the feature being matched on change depending on
the shape of the unknown region at each iteration.
The second challenge is that as the search space increases, there is higher chance of a synthesis algorithm finding
locally matching but semantically invalid image fragments.
Existing image completion methods might produce sterile
images but they do not risk putting an elephant in someone’s back yard or a submarine in a parking lot.
The third challenge is that content from other images is
much less likely to have the right color and illumination to
seamlessly fill an unknown region compared to content from
the same image. More than other image completion methods, we need a robust seam-finding and blending method to
make our image completions plausible.
In this work, we alleviate both the computational and
semantic challenges with a two-stage search. We first try to
find images depicting semantically similar scenes and then
use only the best matching scenes to find patches which
match the context surrounding the missing region. Scene
matching reduces our search from a database of one million
images to a manageable number of best matching scenes
( 60 in our case), which are used for image completion. We
use a low-dimensional scene descriptor16 so it is relatively
fast to find the nearest scenes, even in a large database. Our
approach is purely data driven, requiring no labeling or
supervision.
In order to seamlessly combine image regions we employ Poisson’s blending. To avoid blending artifacts, we first
perform a graph cut segmentation to find the boundary for
the Poisson blending that has the minimum image gradient
magnitude. This is in contrast to minimizing the intensity
domain difference along a boundary25 or other heuristics
to encourage a constant intensity offset for the blending
boundary. 11 In Section 4, we explain why minimizing the
seam gradient gives the most perceptually convincing compositing results.
The image completion work most closely resembling our
own, Wilczkowiak et al. 25 also demonstrates the search of
multiple images. However, in their case it was only a few images that were hand selected to offer potentially useful image
regions. Also related are methods which synthesize semantically valid images either from text or image constraints. 5, 12
These methods create semantically valid images through
explicit semantic constraints using image databases with
semantically labeled regions. The database labeling process
must be supervised5 or semisupervised. 12
3. semantic scene matchinG
Since we do not employ user-provided semantic constraints
or a labeled database, we need to acquire our semantic
knowledge from the data directly. This requires us to sample the set of visual images as broadly as possible. We constructed our image collection by downloading all of the photographs in 30 Flickr.com groups that focus on landscape,
travel, or city photography. Typical group names are “lonely
planet,” “urban fragments,” and “rural decay.” Photographs
in these groups are generally high quality. We also downloaded images based on keyword searches such as “travel,”
“landscape,” and “river.” We discarded all duplicate images
and all images that did not have at least 800 pixels in their
largest dimension and 500 pixels in their smallest dimension. All images were down-sampled, if necessary, such that
their maximum dimension was 1024 pixels. Our database
downloading, preprocessing, and scene matching are all