Building Rome in a Day
By Sameer Agarwala, Yasutaka Furukawaa, Noah Snavely, Ian Simonb, Brian Curless, Steven M. Seitz, and Richard Szeliski
We present a system that can reconstruct 3D geometry
from large, unorganized collections of photographs such
as those found by searching for a given city (e.g., Rome) on
Internet photo-sharing sites. Our system is built on a set
of new, distributed computer vision algorithms for image
matching and 3D reconstruction, designed to maximize
parallelism at each stage of the pipeline and to scale gracefully with both the size of the problem and the amount of
available computation. Our experimental results demonstrate that it is now possible to reconstruct city-scale image
collections with more than a hundred thousand images in
less than a day.
Amateur photography was once largely a personal endeavor.
Traditionally, a photographer would capture a moment on
film and share it with a small number of friends and family
members, perhaps storing a few hundred of them in a shoe-box. The advent of digital photography, and the recent growth
of photo-sharing Web sites such as Flickr.com, have brought
about a seismic change in photography and the use of photo
collections. Today, a photograph shared online can potentially be seen by millions of people.
As a result, we now have access to a vast, ever-growing
collection of photographs the world over capturing its cities and landmarks innumerable times. For instance, a
search for the term “Rome” on Flickr returns nearly 3 million photographs. This collection represents an increasingly
complete photographic record of the city, capturing every
popular site, façade, interior, fountain, sculpture, painting, and café. Virtually anything that people find interesting
in Rome has been captured from thousands of viewpoints
and under myriad illumination and weather conditions. For
example, the Trevi Fountain appears in over 50,000 of these
How much of the city of Rome can be reconstructed in
3D from this photo collection? In principle, the photos of
Rome on Flickr represent an ideal data set for 3D modeling
research, as they capture the highlights of the city in exquisite detail and from a broad range of viewpoints. However,
extracting high quality 3D models from such a collection is
challenging for several reasons. First, the photos are
unstructured—they are taken in no particular order and we have no
control over the distribution of camera viewpoints. Second,
they are uncalibrated—the photos are taken by thousands
of different photographers and we know very little about
the camera settings. Third, the scale of the problem is
a This work was done when the author was a postdoctoral researcher at the
University of Washington.
b Part of this work was done when the author was a graduate student at the
University of Washington.
enormous—whereas prior methods operated on hundreds
or at most a few thousand photos, we seek to handle collections two to three orders of magnitude larger. Fourth, the
algorithms must be fast—we seek to reconstruct an entire
city in a single day, making it possible to repeat the process
many times to reconstruct all of the world’s significant cultural centers.
Creating accurate 3D models of cities is a problem of
great interest and with broad applications. In the government sector, city models are vital for urban planning and
visualization. They are equally important for a broad range
of academic disciplines including history, archeology, geography, and computer graphics research. Digital city models
are also central to popular consumer mapping and visualization applications such as Google Earth and Bing Maps,
as well as GPS-enabled navigation systems. In the near
future, these models can enable augmented reality capabilities which recognize and annotate objects on your camera
phone (or other) display. Such capabilities will allow tourists to find points of interest, driving directions, and orient
themselves in a new environment.
City-scale 3D reconstruction has been explored previously.
2, 8, 15, 21 However, existing large scale systems operate
on data that comes from a structured source, e.g., aerial
photographs taken by a survey aircraft or street side imagery
captured by a moving vehicle. These systems rely on photographs captured using the same calibrated camera(s) at a
regular sampling rate and typically leverage other sensors
such as GPS and Inertial Navigation Units, vastly simplifying computation. Images harvested from the Web have none
of these simplifying characteristics. Thus, a key focus of our
work has been to develop new 3D computer vision techniques that work “in the wild,” on extremely diverse, large,
and unconstrained image collections.
Our approach to this problem builds on progress made
in computer vision in recent years (including our own recent
work on Photo Tourism18 and Photosynth), and draws from
many other areas of computer science, including distributed systems, algorithms, information retrieval, and scientific computing.
2. stRuctuRe fRom motion
How can we recover 3D geometry from a collection of
images? A fundamental challange is that a photograph is a
two-dimensional projection of a three-dimensional world.
Inverting this projection is difficult as we have lost the depth
of each point in the image. As humans, we can experience
this problem by closing one eye, and noting our diminished
The original version of this paper was published in the
Proceedings of the 2009 IEEE International Conference on