4. 1. Whole image similarity
A natural idea is to come up with a compact representation
for computing the overall similarity of two images, then use
this metric to propose edges to test.
For text documents, there are many techniques for
quickly comparing the content of two documents. One common method is to represent each document as a vector of
weighted word frequencies11; the distance between two such
vectors is a good predictor of the similarity between the corresponding documents.
Inspired by this work in document analysis, computer
vision researchers have recently begun to apply similar
techniques to visual object recognition with great success.
5, 14, 16, 17 The basic idea is to take the SIFT features in a
collection of photos and cluster them into “visual words.”
By treating the images as documents consisting of these
visual words, we can apply the machinery of document
retrieval to efficiently match large data sets of photos. We
use a fast tree-structured method for associating visual
words with image features.
14 Each photo is represented as
a sparse histogram of visual word which we weight using
the well-known “Term Frequency Inverse Document
Frequency” (TFIDF) method11; we compare two such histograms by taking their inner product. For each image,
we determine the k1 + k2 most similar images, and verify
the top k1 of these. This forms the proposals for the first
round of matching.
At this stage, we have a sparsely connected match
graph. To derive the most comprehensive reconstruction
possible, we want a graph with as few connected compo-
nents as possible. To this end, we make further use of the
proposals from the whole image similarity to try to con-
nect the various connected components in this graph. For
each image, we consider the next k2 images suggested by
the whole image similarity and verify those pairs which
straddle two different connected components. We do
this only for images which are in components of size two
or more.c
Despite their scale invariance and robustness to appear-
ance changes, SIFT features are local and do not contain
any global information about the image or about the loca-
tion of other features in the image. Thus feature matching
based on SIFT features is still prone to errors. However,
since we assume that we are dealing with rigid scenes,
there are strong geometric constraints on the locations of
the matching features and these constraints can be used to
clean up the matches. In particular, when a rigid scene is
imaged by two pinhole cameras, there exists a 3 × 3 matrix
F, the Fundamental matrix, such that corresponding points
xij and xik (represented in homogeneous coordinates) in two
images j and k satisfy10:
A common way to impose this constraint is to use a greedy
randomized algorithm to generate suitably chosen ran-
dom estimates of F and choose the one that has the larg-
est support among the matches, i.e., the one for which the
most matches satisfy ( 3). This algorithm is called Random
Sample Consensus (RANSAC)
6 and is used in many com-
puter vision problems.
4. cit Y-scaLe matchinG
Section 3 describes how to find correspondences between
a pair of images. However, given a large collection with
tens or hundreds of thousands of images, our task is to
find correspondences spanning the entire collection. One
way to think about this image matching problem is as a
graph estimation problem where we are given a set of vertices corresponding to the images and we want to discover
the set of edges connecting them. In this graph an edge
connects a pair of images if and only if they are looking at
the same part of the scene and have a sufficient number of
feature matches between them. We will call this graph the
match graph.
A naive way to determine the set of edges in the match
graph is to perform all O(n2) image matches; for large collections, however, this is not practical. For a set of 100,000
images, this translates into 5,000,000,000 pairwise comparisons, which with 500 cores operating at 10 image
pairs per second per core would require about 11. 5 days
to match, plus all of the time required to transfer the
image and feature data between machines. Further, even
if we were able to do all these pairwise matches, it would
be a waste of computational effort since an overwhelming
majority of the image pairs do not match, i.e., the graph is
sparse. We expect this to be the case for images from an
entire city.
Instead, our system uses a multiround scheme: in
each round we propose a set of edges in the match graph,
then verify each edge through feature matching. If we
find more than a minimum number of features, we keep
the edge; otherwise we discard it. Thus, the problem
reduces to that of formulating a method for quickly predicting when two images match. We use two methods to generate proposals: whole image similarity and query expansion.
4. 2. Query expansion
After performing the two rounds of matching based on
whole image similarity, we have a sparse match graph, but
this graph is usually not dense enough to reliably produce
a good reconstruction. To remedy this, we use another
idea from text and document retrieval research—query
expansion.
5
In its original form, query expansion takes a set of documents that match a user’s query, then queries again with
these initial results, expanding the initial query. The final
results are a combination of these two queries. If we define
a graph on the set of documents (including the query), with
similar documents connected by an edge, then query expansion is equivalent to finding all vertices that are within two
steps of the query vertex.
In our system, for every vertex j in the match graph, if
vertices i and k are connected to j, we propose that i and k are
also connected, and verify the edge (i, k). This process can be
c We use k1 = k2 = 10 in all our experiments.