Doi: 10.1145/1400181.1400202

Scene Completion Using Millions
of Photographs

By James Hays and Alexei A. Efros

abstract

What can you do with a million images? In this paper, we present a new image completion algorithm powered by a huge database of photographs gathered from the Web. The algorithm patches up holes in images by finding similar image regions in the database that are not only seamless, but also semantically valid. Our chief insight is that while the space of images is effectively infinite, the space of semantically differentiable scenes is actually not that large. For many image completion tasks, we are able to find similar scenes which contain image fragments that will convincingly complete the image. Our algorithm is entirely data driven, requiring no annotations or labeling by the user. Unlike existing image completion methods, our algorithm can generate a diverse set of image completions and we allow users to select among them. We demonstrate the superiority of our algorithm over existing image completion approaches.

1. intRoDuction

Every once in a while, we all wish we could erase something from our old photographs. A garbage truck right in the middle of a charming Italian piazza, an ex-boyfriend in a family photo, a political ally in a group portrait who has fallen out of favor. 13 Other times, there is simply missing data in some areas of the image: (a) an aged corner of an old photograph (b) a hole in an image-based 3D reconstruction due to occlusion, and (c) a dead bug on the camera lens. Image completion (also called inpainting or hole-filling) is the task of filling in or replacing an image region with new image data such that the modification cannot be detected.

There are two fundamentally different strategies for image completion. The first aims to reconstruct, as accurately as possible, the data that should have been there, but somehow got occluded or corrupted. Methods attempting an accurate reconstruction have to use some other source of data

in addition to the input image (Figure 1), such as video ( using various background stabilization techniques) or multiple photographs of the same scene. 1, 19

The alternative is to try finding a plausible way to fill in the missing pixels, hallucinating data that could have been there. This is a much less easily quantifiable endeavor, relying instead on the studies of human visual perception. The most successful existing methods4, 6, 24, 25 operate by extending adjacent textures and contours into the unknown region. These algorithms are similar to texture synthesis algorithms such as, 8, 7, 14, 15 sometimes with additional constraints to explicitly preserve Gestalt cues such as good continuation, 23 either automatically 4 or by hand. 20 Importantly, all of the existing image completion methods operate by filling in the unknown region with content from the known parts of the input source image.

Searching the source image for usable texture makes a lot of sense. The source image often has textures at just the right scale, orientation, and illumination as needed to seamlessly fill in the unknown region. Some methods6, 25 search additional scales and orientations to gain additional source texture samples. However, viewing image completion as constrained texture synthesis limits the type of completion tasks that can be tackled. The assumption present in all of these methods is that all the necessary image data to fill in an unknown region is located somewhere else in that same image. We believe this assumption is flawed and that the source image simply does not provide enough data except for trivial image completion tasks.

Typical demonstrations of previously published algorithms are object removal tasks such as removing people, signs, horses, or cars from relatively simple backgrounds. The results tend to be fairly sterile images because the algorithms are only reusing image content that appeared somewhere else in the same image. For situations in which the incomplete region is not bounded by texture regions, or when

figure 1: Given an input image with a missing region, we use matching scenes from a large collection of photographs to complete the image.

Original image

Input

Scene matches

Output

References:

Archives