a certain threshold, determined by the memory limitations of the machines.
The resulting clustering problem is a constrained discrete optimization problem (see Furukawa et al.
9 for algorithmic details).
After the clustering, we solve for scene geometry
within each cluster independently using a MVS algorithm, and then combine the results.
9 This strategy not
only makes it possible to perform the reconstruction, but
also makes it straightforward to do so in parallel on many
We report the results of running our system on three city-scale data sets downloaded from Flickr: Dubrovnik, Rome,
The SfM experiments were run on a cluster of 62 nodes
with dual quad-core processors, on a private network with
1GB/s Ethernet interfaces. Each node had 32GB of RAM and
1TB of local hard disk space with the Microsoft Windows
Server 2008 64-bit operating system. For encoding the
images as TFIDF vectors, we used a set of visual words, created from 20,000 images of Rome. The images used to create the visual word vocabulary were not used in any of the
Figure 4 shows reconstructions of the largest connected components of these data sets. Due to space considerations, only a sample of the results are shown here.
Complete result are posted at http://grail.cs.washington.
For whole image similarity proposals, the top k1 = 10 were
used in the first verification stage, and the next k2 = 10 were
used in the second component matching stage. Four rounds
of query expansion were done. In all cases, the ratio of the
number of matches performed to the number of matches
verified starts dropping off after four rounds. Table 1 summarizes statistics of the three data sets.
The SfM timing numbers in Table 1 bear some
explanation. It is surprising that running SfM on Dubrovnik
took so much more time than for Rome, and is almost the
same as Venice, both of which are much larger data sets.
The reason lies in the structure of the data sets. The Rome
and Venice sets are essentially collections of landmarks
which mostly have a simple geometry and visibility structure. The largest connected component in Dubrovnik, on
the other hand, captures the entire old city. With its complex visibility and widely varying viewpoints, reconstructing Dubrovnik is a much more complicated SfM problem.
This is reflected in the sizes of the skeletal sets associated
with the largest connected components shown in Table 2.
Figure 4 also shows the results of running our MVS9 on
city-scale reconstructions produced by our matching and
SfM system. Figure 4 shows MVS reconstructions (
rendered as colored points) for St. Peter’s Basilica (Rome),
the Colosseum (Rome), Dubrovnik, and San Marco Square
(Venice), while Table 3 provides timing and size statistics.
The largest dataset—San Marco Square—contains
14,000 input images which were processed into 67
clusters and yielded 28 million surface points in less than
3h. While our system successfully reconstructs dense
and high quality 3D points for these very large scenes,
our models contain holes in certain places. For example, rooftops where image coverage is poor, and ground
planes where surfaces are usually not clearly visible. On
the other hand, in places with many images, the reconstruction quality is very high, as illustrated in the close-ups in Figure 4.
A search on Flickr.com for the keywords “Rome” or “Roma”
results in over 4 million images. Our aim was to be able to
reconstruct as much of the city as possible from these photographs in 24h. Our current system is about an order of
magnitude away from this goal. Since the original publication of this work, Frahm et al. have built a system that uses
the massive parallelism of GPUs to do city scale reconstructions on a single workstation.
In our system, the track generation, skeletal sets,
and reconstruction algorithms are all operating on the
level of connected components. This means that the largest few components completely dominate these stages.
We are currently exploring ways of parallelizing all three of
these steps, with particular emphasis on the SfM system.
Another issue with the current system is that it produces a set of disconnected reconstructions. If the images
come with geotags/GPS information, our system can try and
geo-locate the reconstructions. However, this information is
frequently incorrect, noisy, or missing.
The runtime performance of the matching system
depends critically on how well the verification jobs are
distributed across the network. This is facilitated by
the initial distribution of the images across the cluster
nodes. An early decision to store images according to the
name of the user and the Flickr ID of the image meant
that most images taken by the same user ended up on the
same cluster node. Looking at the match graph, it turns
out (quite naturally in hindsight) that a user’s own photographs have a high probability of matching amongst
themselves. The ID of the person who took the photograph is just one kind of meta-data associated with these
images. A more sophisticated strategy would exploit all
the textual tags and geotags associated with the images
to predict what images are likely to match distributing
the data accordingly.
Finally, our system is designed with batch operation in
mind. A more challenging problem is to make the system
This work was supported in part by SPAWAR, NSF grant
IIS-0811878, the Office of Naval Research, the University of
Washington Animation Research Labs, and Microsoft. We
thank Microsoft Research for generously providing access
to their HPC cluster and Szymon Rusinkiewicz for Qsplat
software. The authors would also like to acknowledge
discussions with Steven Gribble, Aaron Kimball, Drew
Steedly and David Nister.