distributed among a cluster of 15 machines. In total we acquired about 2. 3 million unique images totalling 396 giga-bytes of JPEG compressed data.
In order to successfully complete images, we need to
find image regions in our database that are not just seamless and properly textured but also semantically valid. We do
not want to complete hillsides with car roofs or have ducks
swimming in city pavement which locally resembles a lake.
To help avoid such situations, we first look for scenes which
are most likely to be semantically equivalent to the image requiring completion. The use of scene matching is the most
important element of our image completion method. Without it, our search would not be computationally feasible and
our image completion results would rarely be semantically
valid. Our scene matching, in combination with our large
database, allows us to do image completion without resorting to explicit semantic constraints as in previous photo synthesis work. 5, 12
We use the gist scene descriptor16, 22 which has been shown
to perform well at grouping semantically similar scenes
(e.g., city, tall buildings, office, fields, forest, and beach) and
for place recognition. It must be noted that a low-level scene
descriptor in and of itself cannot encode high-level semantic information. Scenes can only be trusted to be semantically similar if the distance between them is very small. The
way we address this issue is by collecting a huge dataset of
images making it more likely that a very similar scene to the
one being searched is available in the dataset. Indeed, our
initial experiments with the gist descriptor on a dataset of
ten thousand images were very discouraging. However, increasing the image collection to one million yielded a qualitative leap in performance (see Figure 3 for a typical scene
matching result). Independently, Torralba et al. 21 have observed a similar effect with a dataset of up to 70 million tiny
( 32× 32) images.
The gist descriptor aggregates oriented edge responses
at multiple scales into very coarse spatial bins. We found a
gist descriptor built from six oriented edge responses at five
scales aggregated to a 4× 4 spatial resolution to be the most
effective. We augment the scene descriptor with the color
information of the query image down-sampled to the spatial
resolution of the gist.
Searching for similar scenes first rather than directly
looking for similar patches speeds up our search dramatically. Instead of looking for image patches in all one million
images at multiple offsets and scales, we can instead eliminate 99.99% of the database quickly by finding the nearest
neighbor scenes based on the relatively low-dimensional
scene descriptor.
Given an input image to be hole-filled, we first compute
its gist descriptor with the missing regions excluded. This is
done by creating a mask which weights each spatial bin in
the gist in proportion to how many valid pixels are in that bin.
We compute the L distance between the gist of the query im-
1
age and every gist in the database, weighted by the mask. The
color distance is computed in the L*a*b* color space. These
distances are combined and weighted such that the gist contributes roughly twice as much as the color information to the
final distance.
Because we match scenes using arbitrary dimensions
of the descriptor (depending on which regions of a query
image are missing), we cannot use principal components
analysis (PCA) dimensionality reduction as suggested in16
For the same reason, we do not use any approximate nearest-neighbor scheme since they tend to incur large relative errors
when matching on arbitrary dimensions of descriptors with
hundreds of dimensions. However, the descriptors are small
enough to exhaustively search in a few minutes using a cluster of 15 machines.
4. LocaL context matchinG
Having constrained our search to semantically similar
scenes, we can use traditional template matching to more
precisely align those matching scenes to the local image
context around the missing region. The local context is all
pixels within an 80 pixel radius of the hole’s boundary. This
context is compared against the 200 best matching scenes
across all valid translations and three scales (0.81, 0.90, 1)
using pixel-wise SSD error in L*a*b* color space. Only placements (translations and scales) for which the context is fully
figure 3: the first 164 nearest neighbor scenes for the incomplete image in the center. most of the scenes are semantically and structurally
similar; many are even from the same city (London).